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| ?g- F.X | PCl-Ex | PCl-Ex | PCI -Ex | PCl-Ex| PCI -E»| FC1-E* 
| Bridge | Bridge 1 Bridge 1 Bridge 1 Bridge | Bridge 1 Bridge 



qi WSM C3 ESI 

153 EE3 EB3 EBB 
I'-' B^l ikiji rr»iB'fl 



NextIO Switches 



J2E3 k«iC^ K^ffl fT»e31 i>3--n t.-t ji i.fcgJi 

■ vim ■ ;.««■ ■ ■ ■« ■ C*9 

■ ;Tt:rTl liTr-Tl l: - ^"! li^Tl liT-n rfT'Tl IlV-TI 

wz+M WSM hsfl WSM 
EBB 15e3 EB3 EB3 
RSCT LVIfl FTra f?t T ^ 





s 



00 



Figure 1.3-1: 8-Server Chassis Example 



1.4.2 MAC Features 



PCI Express VI .Oa compliant 

Can be configured to operate as a Shared Port or a Non-Shared Port. 
Layered architecture with Configuration Block, Presentation Layer, Transaction 
Layer, Data Link Layer, and Logical Physical Layer. 
Supports xl, x2, x4, and x8 links 

Supports up to 16 OS Domains and up to 8 Virtual Channels with a maximum of 
16 different OSD/VC combinations. 
Provides a 64-bit data path at 250MHz 

Supports configurable maximum packet sizes in the range of 128B to 1KB 
Performs PHY level link negotiations 
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• The MAC is divided into 8 sub-modules, where each sub-module consists of 8 
physical lanes that can be negotiated to operate as 1 x8 link, or 2 links of xl, x2, 
or x4 lane widths 

• 64KB of Receive Data Buffer and 6 1 44 Bytes of Header Buffer (256 outstanding 
transactions per port) 

• Supports cannibalization of Receive Data Buffers and Receive Header Buffers. 
An 8x link can use all 128KB of Receive Data Buffer and all 512 Header Credits. 

• 8-bit Single Error Correction and Double Error Detection (SECDED) on Rx Data 
& Header Buffers 

• Maintains Transaction Ordering within each egress port per OSD/ 

• Add Power States supported 

• Add latency statements (mention 'fast path 5 feature) 

1.4.3 Switch Core Features 

The switch core is responsible for forwarding transacti^^^et daHf^gril the 16 RX 
MAC interfaces over to the 16 TX MAC interfaces, jjftfc wu%e implemented as a data 
crossbar and a transaction scheduler. The high level fuTictions "Ji|$he switch core are: 





wsTqr programmable fairness at the 

^ en^put port per VC) that transfers 
destination 

if 

* that each input feeds its own virtual 



Implements a transaction scheduler that ; 
different arbitration levels (per input Qf 
one transaction per clock from a source 
Provides virtual output queue (W)Q) st 
switch from a software con^fi^pn viet 

Implements a data crossban||jt ell|^ntlf moves the packet data from the 16 RX 
MACs over to the 1 

Provides an interfa^e^gheXwitcFteanagement logic for transactions that terminate 
in our device 

Lookup inter^j|for PO bn|ge routing tables (support for address routing, ID-based 
bus routing, ancfe^^lti^qufing) 

V 




2 Ndxsfs Maj|ro Architecture 
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2.1 Detail d Block Diagram 

2.1.1 Chip Level Block Diagram 

2.1.2 Block Descriptions 

2.2 System View of Chip 

2.2.1 PCI-Bus enumeration - Prashantha 

2.2.2 Transaction Routing 

According to the PCI Express Base Spec (la), there are three 
address, ID, or implicit. The type of routing used is dependent j 
header (and the routing sub-field, r[2:0], of the Type field foj 
Each routing type will be covered in the next few section^ 

In PCI Express, each switch is logically a set of virtu^ 
below. 




| art e a p a%j£T 
" jj€fthe 
hsaofions). 



nnected as shown 




Virtual 



Virtual PCI Bus 4 



Virtual PCI Bus 3 



irtual 
V P2 P bridge 



Virtual 
P2P bridge 



Port 3 



Endpoint 



Virtual PCI Bus 6 



Port 4 



PCI-Express/ 
PCI bridge 



Endpoint 



Endpoint 



Figure 2.2-1: 
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This topology is a simplified version of our switch since it only shows 3 ports instead of 
16, but it illustrates the idea by having 4 virtual P2P bridges, and 5 PCI buses, numbered 
2 through 6. This topology will be used to show examples of how our routing will work 
and also shows where our on-chip device, the Common On-chip Processor (COP), will be 
located in the topology once discovery is completed by the root complex. 



2.2.3 P2P bridges 

A P2P bridge forwards memory and I/O transfers from one PCI bus over to the other side 
of the bridge where another PCI bus resides. Each P2P bridge has a 256Biieader register 
space that is accessible by the system and is used to set up all the paramef^for the y 
bridge. The following diagram shows the bridge header (Also calle<|a Type i|||eadej) 
with the fields highlighted that apply to transaction routing. 




Figure 2.2-2: 

All aflhesfe ^^tei^/lre documented in the P2P Bridge spec (pp. 26-54), with some 
changes made ii^O Express (base spec la, section 7.5.3, pp. 327-330). The addressing 
startf|ithe tog right of the table as register address 0x0 (lower 8 bits of Vendor ID) and 
contini&teitne bottom left as register 0x3F (upper 8 bits of Bridge Control). 



2.2.4 Address routing 

Address routing is used for memory (32-bit and 64-bit) and I/O transactions that must 
pass through our switch. Each will be discussed in detail in the next few sections. 

The header fields important to address routing are shown in the next two diagrams. 

+0 | +1 | +2 |+3 
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Figure 2.2-3: Header format for 64-bit Transactions (prefetchable memory) 




Figure 2.2-4: Header format for 32-bit Tran^c%>nslmemory a|ul I/O) 




2.2.4.1.1 Memory transactions 

In P2P bridges, there are 2 different ^te^s ?S|ges th|t can be defined, the 32-bit 
memory range (which is required)j|id th^^blf^mory range (which is optional). Our 
switch will implement both range^Whe brifge header registers for both memory 
transaction types will need tqBlW up^f ttefsystem. The memory TLP in PCI express 
does not specify which of|fi(e^^^ will fit into (the SAC and DAC 

address cycles don't e^sWIke^e^y do^KPCI), so both the memory range and the 
prefetchable memoiy^S^efffct be compared against for a 32-bit transaction since the 
prefetchable ran^^n be%|lo^^0B. A memory transaction is defined as follows: 



r 





V Fmt[l:0] 


Type[4:0] 


Description 


MRU ^ 


y 00 
1 01 


0 0000 


Memory read 


|^MRdLi(%^ 


00 
01 


0 0001 


Memory read-locked 




10 

11 


0 0000 


Memory write 



Table 2.2-1: 

The LSB in in the Fmt[l :0] field each of the three memory request types is high if a 64- 
bit address is present and low for a 32-bit address. 

Each memory request (for decode purposes of the bridge) is addressing 1 MB of memory, 
so the lower 20 bits are assumed to be zero to match the memory space to the base/limit 
range. Note that any memory transaction that is addressing less than 4 GB is always a 
32-bit transaction. Only transactions above 4 GB may use 64-bit address mode TLP and 
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are defined as prefetchable transactions that use the prefetchable base/limit registers. 
Following is a diagram that shows one way the system could provision the two memory 
address ranges. 



Primary bus 



Secondary bus 



4GB Boundary 




Figure 2.2-5: 

;m^the ajldress 0x40_0000 going from north to south would still get through. 
addrefl%^0_0000_0000 would go through from north to south. The address 
ffff|_ffff_ FFFF from south to north would also pass across the bridge in this 




In t: 
Alsdl the 
OxFff 
config 



Bottom line is that for 32-bit memory transactions, both the standard memory range, the 
prefetchable range, and the memory-mapped BAR space must be examined to process a 
transaction. For 64-bit transactions, only the prefetchable range must be examined. 

2.2.4.1.2 I/O transactions 

I/O transactions are limited to a 32-bit space, with a base and a limit being specified 
within the P2P bridge header just like memory transactions. The following fields are 
used to define the I/O TLP: 
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TLP Type 


Fmt[l:0] 


Type[4:01 


Description 


IORd 


00 


0 0010 


I/O read request 


IOWr 


10 


0 0010 


I/O write request 



Table 2.2-2: 



This type of access is 4kB-aligned (lower 12 bits are assumed to be zero for matching the 
range), so only the upper 20 bits of an I/O transfer are compared for a match in the 
address range. For transactions traveling downstream, the I/O address must match within 
the I/O range, and the transaction will be forwarded downstream. For trari||ctions 
travelling upstream, the address must match outside the range, and the transaction w|j[ be 
forwarded upstream. The following diagram shows how a sample i/^address^baip/fimit 
pair can be set up, using base of 0x20J)000 and a limit of 0x7F^FFFR^^ r 



Primary bus 




Se^ip^ary bus 



0xFFFF_FFFF 
< 


• \ 






< 






► 


7 


0x0 


< 



Figure 2.2-6: 

For example, using this configuration, a transaction of 0x40_0000 is presented on the 
bridge's primary bus interface. Since it matches within the range programmed in the I/O 
base/limit registers, the transaction is forwarded downstream to the bridge's secondary 
bus. Next, a transaction for address 0x0 is presented on the bridge's secondary bus. 
Since this is NOT within the base/limit range, this transaction is forwarded up to the 
bridge's primary bus. In order to disable this range, the limit must be programmed to a 
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smaller number than the base. This means all downstream transactions will be rejected, 
and all upstream transactions will be allowed through. In the event of an error (an 
address of 0x10 0000 is presented at the primary bus for example), the transaction is 
discarded and an Unsupported Request (UR) message must be sent back out the port. 



2.2.4.1.3 Peer to Peer support 

Upstream traffic in a bridge is only allowed to cross the bridge if it's outside the 
{base,limit} range as described above in the examples. This would mean that if a 
downstream port initiates a read or write request, the address would need to fall outside 
the ranges of that port's P2P bridge. 



Assuming it's outside the range, the address is then compared with ti^|grt's adl|#ls 
range and the other endpoint's address ranges. For the root's rc^rige, th% ^ra^s^must 
again be outside the {base,limit} tuple. If it is, the transacting is^forwa 
If the address is within the {base,limit} range, the other p]|point's%|ige|^re checked for 
downstream matches. If a match is found, the transaction i^rwarde|^tb the other 
endpoint. If no match is found, the bridge issues a Uj^gacket^pk to'the endpoint. 



The same mechanism will be used for downstgea 
Root will be compared first since it is the l; 1 " 
other downstream ports will be checked to 
instead. 



e^PS-j^r support as well. The 
Ely recipient, but if no match is found the 
i^the t£ar£s&ction should be routed there 





So, our architecture will be abjjgto support IF® (inter-processor communication) and 
downstream peer-to-peer if ]^ E&ROK^r f 0Visions our switch such that two roots can 
appear on the same hierajglfy. Pui^^^re^_lookup_module logic does not preclude this 
from happening. We^hs^^^it define! per base/limit pair that allows for a very 
flexible mapping. This l^t wtl^efine each port as either upstream or downstream for 
each port for eacb^pg ^is alpWs the address routing logic to map addresses into the 
ranges and all^ foP^sifea^peer-to-peer and downstream peer-to-peer in the same 
architecture? Tl^^dtfigurable by the EEPROM as to which ports "appear" on each 
OSD durin||device%sc^ry. Note that only the trusted software entity (either a driver 
runn^g%n 6|^OSDlr the I2C interface software) is allowed to modify the routing bits, 
so Cjj|Ds will Mftbe^able to make other OSDs "appear" on their PCI hierarchy by 

it. Any|peer to peer configuration is application specific on how it handles the 



acci 

addre^^pifig. 
2.2.5 ID routing 

ID routing is used for configuration request TLPs, completion TLPs, and optionally 
vendor-defined messages. The TLP header shown is only 12 bytes, but some ID routed 
packets can have a 16 byte header depending on the TLP type. 



+0 


+1 


+2 


+3 


7 1 6 | 5 | 4| 3 | 2 | 1 | 0 


7 | 6| 5| 4| 3| 2| 1 | 0 


7|6|5|4|3|2|l|0 


7|6|5|4|3|2|l|0 
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ByteO 


R 
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TC 
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T 
D 


E 
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Attr 


R 


Length 


Byte 4 




"hese bytes depend on TLP type. 


Byte 8 






Function 
Number 


These bytes depend on TLP type. 



Figure 2.2-7: Header format for ID-based Transactions 



ine 



For this type of transaction, the bus_number[7:0] field is used to determine^which port a 
TLP needs to be sent to. If the busjiumber field indicates one of our swiff's VP2P 
bridge headers is being addressed, the COP will use the device number field^dete 
which header is being addressed and take the appropriate action. 



The header fields are shown in the table below for these transaction typ^l^^^ 



TLP Type 


Fmt|l:0| 


Type[4:01 


<T%. Ascription 


CfgRdO 


00 


00100 


4± % Configuration Read Type 0 


CfgWrO 


10 


0 0100 


\^ Configuration Write Type 0 


CfgRdl 


00 


ooioj^F 


^Sii^ Configuration Read Type 1 


CfgWrl 


10 


o m\ K 


V Configuration Write Type 1 


Msg 


01 




i Message Request - Routed by ID 


MsgD 


11 

A 




p Message Request with data payload - 
routed by ID 


Cpl 




% o lbi) ' 

v 


Completion Without Data -Used for I/O 
and Configuration Write Completions and 
Read Completions (I/O, Configuration, or 
Memory) with Completion Status other than 
Successful Completion 


CplD 




0 1010 


Completion with Data - Used for Memory, 
I/O, and Configuration Read Completions. 


CplLk f 


XT 


0 1011 


Completion for Locked Memory Read 
without Data - Used only in error case. 


MA 




01011 


Completion for Locked Memory Read - 
otherwise like CplD. 



Table 2.2-3: 

Note that configuration requests will always be sent to the COP for processing. The 
function_number fields shouldn't need to be used by the switch since we won't support 
any multi-function devices. 

For messages and completions, the bus number is compared against the programmed 
secondary and subordinate bus numbers for the port's P2P bridge header. An example is 
shown below from a software topology point of view. 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 9 of 222 



NextIO, Inc, 

© All Rights Reserved. 



NEXSIS Overview Document 
V0.8 



Primary bus Secondary bus 



Subordinate bus 



0x3 



0x4 



0x4 



Root Complex 



Port 



Primary bus Secondary bus 



Subordinate bus 



0x2 



0x3 



0x7 



Virtual 
P2P bridge 



Bus 4 



I 



Bus 2 



Virtual 
P2P bridge 



CQP 



Bus 3 



Virtual 
P2P bridge 



Uus5 



Primary^us Secondary bus Subordinate bus; 



0x3 



0x5 



0x5 



Virtual 
P2P bridge 



Port 3 



Endpoint 



Bus 6 



0x3 



0x6 



Port 4 



PCI-Express/ 
PCI bridge 



Endpoint 



Figure 2.2-8: 

Example 1 

A configuration packet arrives on port 1. 
• If the packet is a type 0 configuration^^^cet^the ^ 
immediately returns port 16 as the^^iink^n sin|e 




tressJookup_module 

lis packet will be handled by the 



If the packet is a type 1 configuration pa6|et,ttie Bus Number field of the header is 
compared with Port l's PJP^idge^conlkry and subordinate bus numbers. If (bus 

l||he? * 

sinl^it isn't in the range of the 



er >^lyhe%ffisaction is dropped and a UR transaction is 



number <3) or (bus nur 
signaled back to the,*T\, 
{secondary,subordlWte|^®ple. 
Else if the Bu^^mbetfierd^equal to 3 (the secondary bus number of the P2P 
bridge heade^^^p^kettis Addressing one of the downstream virtual P2P bridges 
inside tlj^^l^. \^gain, the Address_lookup_module returns port 16, and the packet 
is sent | the C(S|. >' 

^^^^^condlry,subordinate} tuples of all 16 destination ports are compared to 
|ee whicn^rjyMs packet should be routed to (ie, secondary_b us [P ort 3] < 

ket_buii number <= subordinate_bus[port 3], etc.). Once a match is found, the 
tiWport is returned. Note this could still return port 16 which would mean the 
BlOSlfaddressing the COP's type 0 header space. 



Primary bus Secondary bus Subordinate bus 



0x7 



2.2.6 Implicit routing 

Message transaction are the only type that can use implicit routing, and the port logic 
should always examine the r[2:0] sub-field for message packets to see how to handle 
them. The following table shows the decoding of the various values. 



r[2:01 | Definition 
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000 


Routed to Root Complex 


001 


Routed by Address 


010 


Routed by ID 


on 


Broadcast from Root Complex 


100 


Local - Terminate at Receiver 


101 


Gathered and Routed to Root Complex (see section 5.3.3.2.1 of 
Base Spec) 


110-111 


Reserved - Terminate at Receiver 



Table 2.2-4: 

If a messages transaction's value of r[2:0] is either 001 (address routing)^^^10 (ID , 
routing), the port logic will follow the same sequence of events listed above ii|§ectiogis 
2.2.4 and 2.2.5. If the value is one of the other values, the route is "i%fejyi£it" sinlfeAe 
definition of the encoding tells the port logic what to do. / 




2.2. 6. 1. 1 Routed to Root complex ^ ^ 

Our chip will have a root complex field assigned to eacti inpj*t port Bifecl on the source 
OSD to handle this type of packet. The AddressJoo^^mod%J|^wiirsimply index the 

numb 



' MAC. 



root table (by OSD number) and return the root jx 

2.2.6.1.2 Broadcast from Root Complexf *\ ^ 
This only applies to two types of TLPs, Pls^J^mJ^f^nd Unlock. Unlock will not 
actually be broadcast (which is lega^afil^nlW thQ^pec) since our switch will track 




PME_Turn_Off will be handled 
lookup logic 

A 




which port has been locked and unipek the a|pro||rate queues. 

the atdke^_lookup_module logic and the COP. The 



it Receiver 



2.2.6.1.3 Local - Tehnmai 

This type of pacl^^ij[l|g^v|^s 6^ returned from the Address_lookup_module as routed 
to port 16 fojj 4 tli^C01^to han&te (ports 0 through 15 are the actual data ports). 



2.2.6.1 d Gather edhnd Routed to Root Complex 

Thisjype of rS^ng^ only used whenever the switch receives a PMETOAck 
messages from downstream ports. The Address lookup module will again return port 
16. ^J^OP^^I scoreboard the responses from all downstream ports. Once the COP 
has recelv!$f PME_TO_Ack packet from each downstream port, it then returns a single 
PME_TO_Ack packet back to the root complex and sets the r[2:0] sub-field to 3 'bl 01 to 
tell the root complex that all downstream ports in the switch responded to the 
PMEJTurn_Off message. 

Note that a timer must also be implemented in this logic to avoid deadlock, since the 
return of the PMC_TO_Ack packet back to the Root should not be blocked due to one 
device's failure to send a PME_TO_Ack in a reasonable amount of time (no time given 
in the spec for this, but 100 ms is mentioned elsewhere as a timeout number, so maybe 
we'll use that). 
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2.2.6.1.5 Reserved - Terminate at Receiver 

These are nearly identical to the Local - Terminate at Receiver types of routing. This 
transaction will return port 16 from the Address Jookup module and will be sent to the 
COP. 



2.2.7 Isolation of OS Domains 

As TLPs are received on a Rx MAC, they are queued into a structure that is separated by 
OS domain. PCI-EX ordering rules are always adhered to within an OS d(^nain but 
never across OS domains to avoid head-of-line blocking conditions. Flowlfentrol is^also 
isolated by OS domain and VC. The PCI-EX base specification def^s flow^^trolmn a 
per VC basis. The switch is configured to allow assignment of VC re^tees pe^&S 
domain to enable absolute flow control (i.e. flow control per VMvzr dorn?ifrf). 
Therefore, from the Rx MAC perspective, each OS domaiiU^alfc^ed it\own buffer, 
queue, and flow control resources. ^ ^ 



In the Switch Core, all transactions are arbitrated usfn 
where fairness is enforced at 3 different levels - 
arbitration. 




und-robin algorithm, 
C arbitration, and OSD 



~%>eari]ljg as a Type 0 "shared" I/O 



The COP manages the different OS domains! 

device. After switch configuration i|?^^||etll^eaclifRoot Port is assigned to a 
particular OS domain, and for eaQh%oot Po^|)us^what ports are targets on that bus and 
their corresponding port/OSD^JLnyl^ RooSPort has no knowledge of the presence of 
any other Root Port. If a R^t^P^^us%|g^ a shared I/O port, it has no knowledge of 
the other OS domains thagean acces^e sKared I/O port. Therefore, at the COP level, 
OS domains are compl^tel^s^fated. V 



One exception to€^gJm^| 
^d^b|^e1se 



ale jFa scenario where the console management software is 
frer. In that case, the OSD is allowed to manage the device- 
jcdessjng them through the COP's type 0 header space. To become 
a trusted bli||e servef| the driver on that OSD must present a key to the switch. If that 
key mi©te^^trus tld key, the COP sets a bit that tells that OSD it can now access our 
devi^-specifi(^ ^iMers. 

2.3 Data Flow Examples 
2.3.1 Address-based Requests 

All Memory transactions (Reads, Writes, and Completions) and I/O transactions (Reads, 
Writes, and Completions) are address-based. The following describes the data flow of an 
address-based TLP: 

1 . The initiator generates a Posted Request. 

2. The Rx MAC receives the Posted Request from the SERDES. 

3. The Rx Physical Layer Module (RxPHY) performs 8b/10b decoding, de- 
scrambles the data stream, and performs clock compensation between the 
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5. 



extracted clock and the core clock. The data stream is in the form of 8 bits per 
clock (at 250 MHz.). It will also perform lane-to-lane deskew if lane width is 
greater than one. 

The Rx Data Link Layer Module (RxDLLM) converts the data from the RxPHY 
to a 64-bit data path at 250MHz regardless of the link width. For example, if the 
link width is xl, it will take 8 elks to gather 64 bits at 250 MHz. The Rx DLLM 
then checks the Sequence Number and LCRC as the TLP is passing through to the 
Transaction Layer. The Rx DLLM also removes the Sequence Number and the 
LCRC from the TLP before it forwards the packet to the Rx Transaction Layer 
Module (TLM). * 
The RxTLM performs TC to VC mapping and OS domain identifi^ion to v 
determine if there are enough header and data resources for the partic^kr TLJ^ If 
there is, the TLP is stored in the header buffer and the data bu|^^f thlrtlkP 
contains payload) and the starting address of the Headejvis forty$v$M^^nQ 
Address Lookup FIFO. All flow control calculation^|^feipleil|fenteJ^n the Rx 
TLM and are scheduled to transmit every time a XM^ is stolid orWc credit is 
de-allocated. 



The Rx TLM stores the header address of eae: 
When the Address Lookup Interface is ready to 



Address Lookup FIFO, 
ansaction to the 



wnen tne Address booKup interlace is ready to present a transaction to the 
Address Lookup Module in the Switcly^^lfti^Eg^^e header address stored in 
the Address Lookup FIFO to acces^tfe l^utrft^ information (in this case, the 32- 
bit or 64-bit address) from the Hed^r ^ffer. T^e following information is 
passed into the Lookup Modujg^^^ 

• address[43:0] - corjMins Qxin^m^pptr 44 bits of a 64-bit memory 
transaction, the upper%$ bits 6j|a 32-bit memory transaction, or the upper 
20 bits of an U 




lookup_tyj||t2:Q] _ rafeki i§*used to specify the transaction type as shown 
in the fdlol^gy^ble. ^He types relative to this transaction type are 
highlighW 




, l:'ookiip type| 0| 


Transaction definition 






' • ♦ ' mmi 




I 3'b010 


ID-based transaction 


• • • wmi 




3'blOO 


Routed to root complex 


3'bl01 


Broadcast from root complex 


3'bll0 


Terminate at receiver 


3'blll 


Reserved 



• port_is_downstream - lets the Address Lookup Module know what the 
most likely lookup sequence should be (routed to root complex). 

• tc[2:0] - used to help determine the egress_qid[3:0] for this transaction. 

• osd[3:0] - this is used in conjunction with the tc[2:0] field to determine 
the egress_qid[3:0]. 
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Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 



7. The Address Lookup Module(ALM) figures out the root port and which ports are 
connected to this ingress port. Then it begins walking the entries (up to 16) to 
find the base/limit pair that matches the address range, based on the array of valid 
ports to search. Note that the root entry is searched first if the 
'port_is_downstream' bit is set. The ALM also determines the egress QID 
number (ranging from 0 to 1 5). 

8. The ALM returns the egress_qid[3:0] and destination _port[4:0] tc^e Address 
Lookup Interface in the RxTLM. The transaction will not be submntta to thk 
Transaction Queues until there is enough data, or until the entfce packet% fe sjjpffed 
when the egress_cut_through_enbl is not set. , 

9. When the Address Lookup Module returns the egress^|i[3 :0] aptf 
destination_port[4:0], the Address Lookup Interfacje^lor^^e in^opiation in a 32 
deep FIFO. This ALM Response FIFO is necess^t^^ case%pJP^nsaction 
Queues are not able to accept the transactions^fast a^the AdSress Lookup 
Module Interface is able to submit them. Wheri^his FIFl|becomes half full, it 
backpressures the Address Lookup ModrfI^3g^^e. This means that the 
Address Lookup Module Interface wilftiot i's^ue a^lnore requests to the Address 
Lookup Module until the ALM R^^ons&FIFO^sJess than half full. 

10. The state of each transaction queue (§toty brWn-empty) in the Rx PM is sent to 
the Switch Core. The Swit^(^^felr%igp4he Presentation Module asking for 
the next transaction in a patellar qi^e jihe Presentation Module will resolve 
the transaction ordering%jd return a jacket header to the Switch Core. The 
Switch Core will ch$ch flo^co^^l^credits and will queue the transaction unless 
it does not pass fl^^co^rol g^i- The Switch Core will use the query 
information to^entU^y\equ^st the Presentation Module to transmit a packet. 

1 1 . The Presentation%aye^^l read the packet information and pop the packet out of 
the Trans^|lli^^^ It will pass the packet information to the packet 
sche^utejj^ \k 

12. The^ack^%heljUler will take the packet information, and create a TLP by 
Jljai^the or^inalTLP header and appending the TLP data. This packet is sent to 
the lJa%|Mo$ir. 

Once theSwitch Data Mover starts transferring the data, the TLP is stored in the 
|Jx TL$1 Mini-FIFO 64 bits at a time. As the TLP is read from the Mini-FIFO, 
li^ffared 10 Header is inserted before the TLP Base Header (if the endpoint is a 
shared I/O). This is fed into the Tx Data Link Layer Module (DLLM). 

14. The Tx DLLM receives the TLP from the Mini-FIFO and starts calculating the 
LCRC along with appending the Sequence Number to the start of the TLP. 

15. The TLP is forwarded to the Tx Physical Layer Module (PLM) where it is 
scrambled and decoded from 8bits to 1 Obits and sent out on the wire. 

2.3.2 Configuration R quests and Completions 

Configuration Request and Completions follow the same steps in the previous section 
except that they are ID based transactions. Steps 6 and 7 would be replaced with: 
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6. 



The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header address stored in 
the Address Lookup FIFO to access the routing information (in this case, the bus 
number) from the Header Buffer. The following information is passed into the 
Lookup Module: 

• address[43:0] - contains the 8-bit bus number. 

• lookup_type[2:0] - field is used to specify the transaction type as shown 
in the following table. The types relative to this transaction type are 



Lookup type[2:0] 


Transaction definitidn%> 


3'b000 


32-bit memory transaction 


3'b001 


64-bit memory tranlfle^ipn 






3'b011 


32-bit Iiji)Ltrfls.actio% ' 


3'bl00 


Routejlfo. root c%pl<3x 


3'bl01 


Broao^cl'st froki root cpriplex 


3'bllO 


f «?fcminate a%rece i ver 


3'blll 


A»»^..Xeserved 



V 



portisjiownstream - lets,#ie Addresslypokup Module know what the 
most likely lookup sequence%^ila b&frbuted to root complex). 
tc[2:0] - used to hel^^Srafee %|<p^ess_qid[3:0] for this transaction. 
osd[3:0] - this is ^psetl^i conjunction with the tc[2:0] field to determine 
the egress_qid[3f(|]. ^ 




Fast Path 



FIFO is empty, the routing information is immediately 



If the Address^Lool^ 

presentq||p|he ^A^dre^Lookup Module in the Switch Core 



The^ddre^,6*o^ip Module(ALM) figures out the root port and which ports are 
9<ontected to%is ingress port. Then it begins walking the entries (up to 16) in t he 

jjqtanber Jpokuptable to find the base/limit pair that matches the address 
range, ral§cfon the array of valid ports to search. Note that the root entry is 
.searched first if the 'port_is_downstream' bit is set. The ALM also determines 
^hg^ggrlss QID number (ranging from 0 to 15). If the ALM determines that a 
particular configuration request is targeted for the COP, it will notify the Address 
Lookup Interface that the TLP configuration type should be changed from type 1 
to type 0 when the TLP is forwarded to the Data Mover. 

2.3.3 Message Requests 

Each type of Message Request behaves differently regarding data flow. The following 
sections describe the various types of Message Requests and how it differs from the data 
flow example in Section 2.3.1. 
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2.33.1.1 INTx Interrupt Signaling Message Requests 

The INTx virtual wire interrupt signaling mechanism is used to support legacy Endpoints 
in cases where the Message Signaled Interrupt mechanism cannot be used. All INTx 
messages are routed to the Root Complex. 

INTx Messages follow the same steps in the previous section except for steps 6 and 7 
would be replaced with: 

6. The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header ad^fess stored in 
the Address Lookup FIFO to access the routing information (in ttfif^se, the\bus 
number) from the Header Buffer. The following informations pass^^to thl 
Lookup Module: 

address [43:0] - contains the 8-bit bus number.^ \^ 



lookup_type[2:0] - field is used to specify thfe4raitsgctio\type as shown 



in the following table. The types relativei^^is transaction type are 



Lookup type [2:0] 


1 rj%. ^ ~ 1 

Transaction^definition 


3'bOOO 


^.^effloryfensaction 


3'b001 j 


*K ol^jjit rhernfery transaction 


3'b010 X 


V ib^based transaction 


3'b011 


hf " 3*1- bit I/O transaction 






3'bJ'6^ I 


| Broadcast from root complex 


3%J(T%. 


f Terminate at receiver 


y^bffk "V 


Reserved 



port is^owrisfeam - llts the Address Lookup Module know what the 
mg||likel^okH^sequence should be (routed to root complex). 
tcT2^^®illi£|tQ rfelp determine the egress_qid[3:0] for this transaction. 

this is used in conjunction with the tc[2:0] field to determine 
the^sss|^id[3:0]. 



Fastflh^ 

If thejjAddress Lookup FIFO is empty, the routing information is immediately 
ihted to the Address Lookup Module in the Switch Core. 




In this case, the Address Lookup Module(ALM) knows that the TLP is going to a 
Root Complex, so it only has to figure out the root port. Then it uses the device 
number to determine the mapping of the INTx virtual wire on primary side of 
bridge. The ALM returns int_map[l :0] to the Address Lookup Interface which 
stores it in the Header Info Code in the Header Buffer so that the Packet 
Generator will know to overwrite the Code in the TLP with the Code provided in 
the Header Info Code. 
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23.3.1.2 Power Management Message Requests 

There are two Power Management Messages that require special handling in the Address 
Lookup Interface in the Rx TLM. They are PME_Turn_Off and PME_TO_Ack. 



2.3.3.1.3 PMEJTurnjOff 

PME_Turn_Off is generated by a root complex to notify all of its downstream ports to 
prepare for power removal. PME_Turn_Off Message has a routing type of 3'bl01 which 
is 'broadcast from root complex'. In this case, steps 6 and 7 are replaced by the 
following: A 

. 6. The Rx TLM stores the header address of each TLP in the Address linkup HFO. 
When the Address Lookup Interface is ready to present a transaction td#ie 
Address Lookup Module in the Switch Core, it uses the heade^ll^jg^s stated in 
the Address Lookup FIFO to access the routing infornje^n ( i n%JjtC^^^^ the bus 
number) from the Header Buffer. The following infj^mat^^n is ftassed into the 
Lookup Module: 

• address[43:0] - contains the 8-bit busjfufnberi 

• lookupjype[2:0] - field is used to spe^Njjr the transaction type as shown 
in the following table. The types^flat^^Q^his transaction type are 
highlighted: V***%> 




Lookup type[2:ft|C 


V Transaction definition 


3'bOOO ■ ^ 


|lf 32-lytlhemory transaction 


3'bOOJ^^ 


%^6it memory transaction 


3'l?0Hk ^ 


| V ID-based transaction 




f 32-bit I/O transaction 




Routed to root complex 






s% ^bllO^" 


Terminate at receiver 


_ \J%1 


Reserved 




poi^is^downstream - lets the Address Lookup Module know what the 
mosffeelf^lookup sequence should be (routed to root complex). 
|c[2:0f- used to help determine the egress_qid[3:0] for this transaction. 

dp^O] - this is used in conjunction with the tc[2:0] field to determine 
'*he egress_qid[3:0]. 



Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 



7. In this case, the Address Lookup Module(ALM) knows that the TLP needs to be 
broadcast to all downstream ports configured to the Root Complex, so it looks up 
the endpoints that the TLP should be sent to by asserting the corresponding bits in 
broadcast_ports[15:0]. The Address Lookup Interface then submits a TLP to the 
Transaction Scheduler for each downstream port designated in 
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broadcast_ports[15:0], along with sending it to the cop (portl6). The Address 
Lookup Interface will then halt all forward progress until it has been notified by 
the COP that the scoreboard is set up and it is ready to receive all the 
PME_Turn_Off Messages from each downstream port. 



2.3.3.1.4 PME_TO_Ack 

The only thing to note for PME_TO_Ack Messages is that when an upstream port sends a 
PME_TO_Ack message, they should all be routed to the COP. The COP will keep a 
scoreboard of all the endpoint ports that received a PME_Turn_Off Message as they 
transmit a PMETOAck message. When all of the endpoint ports have 
PME_TO_Ack message (or the timer expires), the COP will generate one PJ 
Message to the Root Complex. %| 

2.3.3. 1.5 Locked Transaction Message Requests Ik X 

Whenever a particular root requests a locked transaction^^ other st^rcel'going to that 
output will be halted. When the CplLk is received frgif^he^wnstre^n port, all other 
upstream queues going to the root are locked until the\^lock ii^sage is received. 





2.4 Shared Link Descriptions 
2.4.1 AS Encapsulation 

AS header encapsulation only^frtairifto Shaped Ports. Non-shared Ports have no 
knowledge of "Shared I/O^On fl^Rjc^C, the Transaction Layer Module strips the 
AS Header from all TLI^|fote it s%ie^the packet in the in-line buffer. It uses the AS 
Header to determine th^OSll^miain ffsociated with the given TLP. On the Tx MAC, 
the Transaction Layer Modul^serts the AS Header to the given TLP. The OS Domain 



that is inserted a^a|fe£ 




eader is reported by the Switch Core and passed on to 



the Tx Trans^ct% 

The AS^Hetder is described in detail in the follow section. Figure 2.4-2 describes the 
formifofth^^^ier. 

j 

2.4.2%3^0omain Routing 

For our switch, ports can be shareable ports, which means multiple different CPUs can 
address resources over the same PCI-Express link. A maximum of 16 OS Domains (or 
CPUs) will be supported in this implementation, with each port having the capability to 
send and receive from 16 OS Domains, across all 8 VCs possible in the PCI Express 
spec. 

The PCI-Express Advanced Switching (AS) spec incorporates a 8-byte AS header that is 
inserted into the transaction. Our switch will use this header to specify the OS Domain 
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associated with a given transaction, 
located in the packet. 



The following diagram shows where this header is 




mm? 



Transaction Layer 
Data Layer 
Physical Layer 

Figure 2.4-1: 




AS headers specify an 8-bit field, called the PI (Protocol Interja&e) fie! 
what type of AS packet is contained in the payload. We're jjo^emuse iny number 
ranging from 224 through 254 as a vendor-defined PI. T^only ottifej; piaffe of 
information in our AS header will be the 8-bit OS Domain tlkThe proposed AS header 
is shown below. 



+0 

7|6|5|4|3|2|l|0 7 


6 


+ 1 A 

5 4| 3| 2\ \m 


^ \ 
•7 


15 I 4 | 3 | 2 | l| 0 


+3 

7 | 6| 5| 4| 3 | 2 | l| 0 


R 


R 
N 
P 




k j 


y osd 


PI 


1 R 



ByteO 



Byte 4 



FiiiirS 2.4-2 



PI - Protocol Identifier^^Jm AS % 
OSD - OS Domain nui^er^L 

RNP - Resourcejgumber $neseri#(when high, the RN field is valid., when low, it is 



invalid and must b^M«s) 



RN - Resoiif ce Ntoberi,(which buffer this packet belongs to) 
R - reserve!} % T 




\&RX MAC will first check the PI field to make sure this type of AS 
packlkis unde^tood by our switch. We'll have a register within the RX MAC defined 
that con^^^he allowable encoding for this type of AS packet. At first, this will likely 
be our selected vendor-defined PI number. If our technology is adopted by the SIG, a 
standard number will be defined, and this register will then be programmed to contain 
this standard value for use with future devices. 



This AS header allows our switch to map the incoming value of the OS Domain field 
(local to the I/O device or inter-switch port) in the AS header to the global OS Domain 
number within the switch (one of 16 values, 0 through 15 for the first revision of our 
switch). Any packets received on the shared port that use a different PI type will be 
discarded. 
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For upstream ports, the RX MAC will only need to associate the OSD number with a 4- 
bit register number loaded at config time. This number will be the OSD field that is 
passed to the lookupjanit to match the correct set of P2P bridge headers to compare for a 
lookup match. 

2.4.3 Flow Control and Credits 



Flow Control is used to track the queue/buffer space available in the Agent across the 
link. It is used to prevent overflow of receiver buffers. The Flow ControlJnformation is 
conveyed between two sides of the Link using DLLP packets. 

For Non-shared Endpoints (only one OSD), flow control updates foil 
as the PCI Express Base Specification, which is as follows: 



+1 

7 I 6 5 I 4 I 3 I 2 I 1 I 0 



+0 






3 


2 I 1 I 0 


7 | 6 


0 


VC ID 


R 




lLlLq 



ByteO 
Byte 1 



P/NP/Cpl 



HdrFC 



16b CRC 



Figure 2.4-3: DLLP format for Flow Cont^ol^acKetrfor Non-Shared Ports 

P/NP/Cpl: This field specifies the type of^nsactiorf^at is being reported. P - Posted 

Request; NP - Non-posted Request; Cpl - 6^prj^tiori. V 

VC ID: This field specifies the Virti^F^^n^lkhat^ being reported. 

R: Reserved ,4^ l| yf 

HdrFC: This field contains the^edin^glue fo^ Headers of the indicated type. One credit 
value for headers is one ma^i^nul^^ize%^4^ pl us TLP digest. 

DataFC: This field contaMs th^crOT^aLBe for payload data of the indicated type. One 
credit value is equivaj^jfVt^M^bytes cfrclata. 

16bCRC: This field corit^ns^e calculated CRC value of all bits of the packet using the 
polynomial coef^^^^^WBl^ 



For Shared|Endpo%s, ftpw control updates are advertised using the following DLLP to 
accountfbrlhe OSf% F 




1 + o ^m®^ 

7|6|5M3 |2|l|0 


7|6| 


+1 
5|4|3 


2| 1 |0 


7 | 6 | 5 


+1 
4 


3 | 2 | l| 0 


+3 

7 | 6| 5| 4| 3 | 2 | l| 0 


101 1 OV2V1V0 


TT 


R 


OSD 


C 
T 


Credit count 


16bCRC 





ByteO 
Byte 4 



Figure 2.4-4: DLLP format for Flow Control Packet for Shared Ports 



• Type: Upper nibble set to 101 1 for an FC Update shared-link DLLP. The lower 
nibble specifies the VC number. 

• TT: Transaction Type (00 for Posted, 01 for Non-posted, 10 for Completions) 

• R: Reserved 
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• OSD: OSD number 

• CT: Credit Type (0 for header credits, 1 for data credits 

• Credit count: Contains either the 12-bit data credit count or the 8-bit (upper 4 bits 
are zeros) header credit count, based on the value of the CT bit 

• 16bCRC: This field contains the calculated CRC value of all bits of the packet 
using the polynomial coefficient of lOOBh. 

2.4.3.1.1 Receiver Flow Control 

The RxTLM keeps track of flow control accounting functions of its buffers. This 
information is assembled into a FCUpdate DLLP and forwarded to the Txjransaction 
Layer Module where it is scheduled to be transmitted to the Agent across 




For each type of information tracked, the following quantities are cal£^„„„_ „. 
control TLP Receiver accounting (for non-shared ports, these^culatibe^^^pferformed 
for each VC, and for shared ports these calculations are perfjSKim^%for elch VC/OSD 
group): ' . j4L F 



r 



* credits%ranted to the 
/here (Field Size] is 8 for 



• CREDITS_ALLOCATED - The total numberl 
Transmitter since initialization, modulo j2^jjf ld Si; 
headers and 12 for payload data), jfa 

• CREDITS_RECEIVED - The tot^imher o?^C units consumed by valid TLPs 
received since flow control initializlWf moduli 2 [Fie!d Size] (where [Field Size] is 
8 for headers and 1 2 for paylrf$^\^ 

The RxTLM will also check fp^i^^byerr lis. This is done by checking the following 



equation: 



(CREDITS^ALLOCA^ED^^DITS_RECEIVED) modulo 2 [FieldSize3 > 2 tFieIdSize l 12 
The scheduling of^Wsi^^^^UpdateFC DLLPs will obey the following rules: 



• If thte LinlT^nWLO or LOs Link state, UpdateFC DLLPs will be scheduled for 



Sync 



il^ission %ice every 30us or 120us, depending on the status of the Extended 
ic m&m thl Control Link Register. 

"imer will also be implemented with the following rules: 

■ The Timer is active only when the Link is in the L0 or LOs Link 
state. 

■ The Timer has a limit of 200us. 

■ The receipt of any Init or Update FC DLLP resets the Timer. 

■ Upon Timer expiration, the Physical Layer will be instructed to 
retrain the Link. 

Otherwise, for all types of transactions that do not have infinite credit, a Flow Control 
DLLP will be scheduled for transmission after a valid TLP is received and stored, or 
when one unit is made available by TLPs processed. 
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2.4.3. L2 Transmitter Flow Control 

The Transaction Scheduler receives the latest FC updates from all the Tx MACs. These 
FC updates report the most recent number of FC units advertised by the receiver on the 
other side of the link called CREDIT_LIMIT. CREDITLIMIT is used to determine if 
the transactions being transferred to a particular Tx MAC has enough FC credits to be 
forwarded to the appropriate Tx MAC. 

For each type of information tracked, the following quantities are calculated for flow 
control gating: 



CREDITS_CONSUMED - The total number of FC units consumed% TLP v 
transmissions made since flow control initialization, modulo^ [F,eldSlze %vher# 
[Field Size] is 8 for headers and 12 for pay load data). 1^%^ 
CREDIT LIMIT - The most recent number of FC uni^egally^^ed by the 
Receiver. This quantity represents the total numbe^ofif ^lgjedit^made available 



by the Receiver since flow control initialization, r . 
Size] is 8 for headers and 12 for pay load dataj^ \ 

To determine if there is enough credit for the cu 
is evaluated: 



(CREDITLIMIT - (CREDITS_CONSUM®£ 

2[Field Size] < 2[Field Size] ^ 




where [Field 



ie following equation 



curneni tip credits needed)) modulo 



Even though the Transacti^Sch^uler tfetejfnines Flow Control gating of transactions, it 
does not have any knowl^tge ot th^^r-status of the transaction. It is not until the 
Transaction Schedule^forw^^fttie TW to the Tx MAC that it is known if the packet is 
in error. Therefore, theik MHC is responsible for notifying the Transaction Scheduler 
that the current TflfcttgB&flta sMhat it does not affect CREDITS CONSUMED. 




2.4.4 Jte^|t 

Thegfare tw^^g^f Reset on the chip - Fundamental Reset and Hot Reset. The 
following diagram* will be used throughout the document to describe the devices affected 
wherl^ype ojfReset is asserted at a Root Complex, a Root Port, the Switch, a 
DownsWSM#ort, or an Endpoint (I/O device) attached to a Downstream Port. Ports 1, 2, 
3, 9 and 10 are all attached to root complexes and therefore represent one OS domain (for 
simplicity, the OS domain number will directly correlate to the port number, i.e. port 1 is 
assigned OS domain 1 in this example). Switch #1 has been configured such that port 4 
is shared by OS domains 1 and 3, port 5 is only accessed by OS Domain 2, port 6 is 
shared by OS domains 1, 2, and 3, and port 7 is shared by OS domains 1 and 2. 
Downstream port 4 in Switch #1 is connected to Root Port 8 in Switch #2, which enables 
access to the endpoints on Switch #2 from the Root Complexes in Switch #1. 
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Root Complex 
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P1 
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Endpoint 




re 2.4-5: Example Topology 



2.4.4.1.1 



funddm^tal^ieset 

Funda^nt^Reset is|an auxiliary signal provided by the system to a component or add- 
in cafU 11ie%jgnal r#ist be called PERST#. Fundamental Reset can be asserted on the 



RoolComplexes, the Endpoints (I/O Devices), or the Switch. 



When Fundamental Reset is asserted on the Endpoints or the Switch, the behavior is 
identical to what is described in the PCI Express Base Specification - all of the Links 
attached to the device being reset will be retrained, the state machines will be initialized, 
and all TLP information will be flushed. 



If Fundamental Reset is asserted on a Root Complex, not only does the Root Complex get 
reset, but all of its downstream ports must reset as well. If its downstream ports are 
"shared" with other Root Complexes, it is important to be able to reset only the part of 
the downstream port that pertains to the Root Complex being reset, and to leave the rest 
of the downstream port logic unchanged. This is done by transmitting and generating 
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vendor-specific DLLPs called Reset DLLPs that informs the two components on the link 
which OS Domain is getting reset (this is explained further in the following sections). If 
the downstream port is not shared by Root Complexes, the Link will reset according to 
the PCI Express Base Specification. The following sections describe how the chip 
behaves when a Fundamental Reset is asserted on the various parts of the chip. 



2.4.4.1.2 Fundamental Reset initiated at the Switch 

If the Fundamental Reset on the Switch is asserted, it will propagate the Reset to all 
upstream and downstream ports. The devices attached to all the ports will be reset 
according to the PCI Express Base Specification - all of the Links attachedrto the device 
being reset will be retrained, the state machines will be initialized, and alHfiLP y 
information will be flushed. If the fabric topology involves more than one Swftch, md a 
Root Port in another Switch is affected by the reset, then all of the do^^eam^^rff 
assigned to that Root Port are also reset. The components higl^ghted^n^r^to^fn Figure 
2.4-6 depict the components that are affected when Switch J^1%set. Everything inside 
Switch #1 is reset, along with the devices attached to the Arts. NOTktha|©"n Switch #2, 
only one of the root ports is affected by the reset. That^erafeip is exfpfined in detail in 
Section 2.4.4.1.3. 
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Figure 2.4-6: Fundamental Reset initiated at Switch #1 

2.4.4.1.3 Fundamental Reset initiated at the Root Complex 

If the Fundamental Reset on a Root Complex is asserted, the reset must be propagated to 
all its downstream ports without corrupting the traffic from the other OS Domains (or 
Root Complexes). The following steps are taken when a Fundamental Reset is asserted 
at a Root Complex: 

At the Root Complex 

1 . All Port Registers and State Machines must be set to their initial values as 
specified in the PCI Express Base Document. . 

2. The Root Complex will attempt to retrain the link. ^ # 

3. Once both components on the link have entered the initial linl^tfgining^^fthey 
will proceed through Link Initialization and then through Flo\ 
Initialization for VCO. 

At the Switch X s V V 

1 . The Root Port connected to the Root Complex v^ll l^&ain ana|jiritialize all its 
state machines and registers. A 

2. The Root Port will notify the COP that all of its iVwnst retm ports need to be 
reset. 

3. The COP will pass the reset notification \o all^he downstream Ports. 

4. All registers and state machines rd^^^Kth| h$D being reset must be set to 
their initial values. All TLPs ^g^^i^^ the^)^D being reset must be flushed - 
all TLPs stored in the Rx injfine buffers ^Unnaturally drain and all TLPs in the 
Tx retry buffers will drain ar^Ack MXPs are received. During reset, new TLPs 
belonging to the OS^^lj^ refuted/ All other TLPs will be preserved. 

5. The Downstream Edrt willoknotifi^S of the reset condition and the OSD that has 
initiated the reset. jK 

6a. If the Downstf^^p P^& s not shared (i.e. it is only accessed by one Root 

Complex^^)oii^ r2lt*by attempting to retrain the Link. 
6b. If the Dow^tfefe^ixi is shared, all TLPs pertaining to the OSD that has 

initi^teS^I^reset must be flushed. Flow Control must also be updated to reflect 
the iushin^^TpPs. A vendor-specific DLLP is generated called a Reset DLLP 
^feyvj^^owns^eam Port. The Reset DLLP contains the OSD that initiated the 
I reset anlli^iransmitted on the link. A Reset DLLP is transmitted every time the 
% Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
^ton^fetion Arbiter schedules TLPs to transmit on the other OSDs that are 
operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 

At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 
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The device will stay in this reset mode until it stops receiving Reset DLLPs. All 
traffic related to the OSDs that are not being reset will operate normally. 



The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the Root Complex attached to Root Port #1 is reset. Since downstream 
ports 4,6, and 7 are shared ports, only the logic pertaining to OS domain 1 should be 
affected by the reset. On the other hand, port 14 in Switch #2 is only accessed by the OS 
Domain being reset and can therefore reset the entire port by retraining the Link (instead 
of sending Reset DLLPs specific to an OS domain). 
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Figure 2.4-7 : Fundamental Reset initiated at a Root Complex 



2.4.4.1.4 Fundamental Reset initiated at an Endpoint 

If the Fundamental Reset on an endpoint device is asserted, the device will simply reset 
with its link according to the PCI Express Base Specification - all of the Links attached to 
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the device being reset will be retrained, the state machines will be initialized, and all TLP 
information will be flushed. All other Ports on the Switch will not be affected. 



The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the endpoint connected to downstream port 13 is reset. It simply attempts 
to retrain with the port on the other side of the Link. Any transaction being received by 
the Tx MAC from the Switch Core will be discarded and will never reach the Data Link 
Layer Module. The Root Complex will eventually time out when it never receives a 
completion for a particular request. (We could also let the COP generate UR completions 
since it will know which endpoints are in reset. The ALM could keep track of which 
transactions are going to egress ports that are in reset and then route the pallet to the 
COP.) 
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Figure 2.4-8 : Fundamental Reset initiated at an Endpoint 
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2.4.4.1.5 Hot Reset 

Hot Reset is an in-band mechanism for propagating reset across a link. A Link can enter 
Hot Reset if directed by a higher layer, or if it receives two consecutive TS1 ordered sets 
with the Hot Reset bit asserted. The following sections describe how the chip behaves 
when a Hot Reset is asserted on the various parts of the chip. 

2.4.4.1.6 Hot Reset initiated at the Root Port 

A Root Port can enter Hot Reset by having its Secondary Bus Reset bit seMn the Bridge 
Control Register, or by receiving two consecutive TSls with the Hot Reset^U set onsthe 
Link. If the Hot Reset on a Root Port is initiated, the reset must be p^agated%allife 
downstream ports without corrupting the traffic from the other OS dop^ (or R$6t 
Complexes). The following steps are taken when a Hot Reset^initia^^^^dbt Port: 

At the Root Complex 

1 . The Root Complex receives a TS1 sequence w f itK,thtfe>t Resffbit asserted and 
will attempt to retrain the Link. This can happferuf thell^ondary bus reset bit is 
set in the P2P config space. 

2. All Port Registers and State Machin^itQust^^fMlfcheir initial values as 
specified in the PCI Express BaseJ^cument/S^ 

3. Once both components on the link ft^j^Wedlffe initial link training state, they 



will proceed through Link Inj$ 
Initialization for VCO. 
At the Switch 

1 . The Root Port connected 
state machines ag^egi^ers^i 





ii^^pd thin through Flow Control 



Complex will retrain and initialize all its 



2. The Upstream^ort%|ll#ibtify^ COP that all of its downstream ports need to be 



reset. 
The COB* 
All rgg^ 
thei/initiaf 



||l|g^^^e relet notification to all the downstream Ports. 

d statPfnachines relevant to the OSD being reset must be set to 
lu%^All TLPs belonging to the OSD being reset must be flushed - 
e|l itfthe Rx in-line buffers will naturally drain and all TLPs in the 

5 J "in after Ack DLLPs are received. During reset, new TLPs 
belongit?p3%e OSD will be rejected. All other TLPs will be preserved. 
.The Downstream Port will be notified of the reset condition and the OSD that has 
the reset. 

6a. If the Downstream Port is not shared (i.e. it is only accessed by one Root 
Complex), the port is reset by attempting to retrain the Link. 

6b. If the Downstream Port is shared, all TLPs pertaining to the OSD that has 

initiated the reset must be flushed. Flow Control must also be updated to reflect 
the flushing of TLPs. A vendor-specific DLLP is generated called a Reset DLLP 
by the Downstream Port. The Reset DLLP contains the OSD that initiated the 
reset and is transmitted on the link. A Reset DLLP is transmitted every time the 
Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
Transaction Arbiter schedules TLPs to transmit on the other OSDs that are 
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operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 
At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 

The device will stay in this reset mode until it stops receiving Reset DLLPs. All 

traffic related to the OSDs that are not being reset will operate normally. 

2.4.4.1. 7 Hot Reset initiated at the COP A% 

A Hot Reset can be initiated at the Switch through a set of Registers Jhat can%^acce^ed 
via an I2C interface. The Hot Reset can be programmed such that it ^^g^erate^^ per 
port and/or per OSD basis. If the Hot Reset is propagated to Ports wit^p^MilMnguishing 
between OSDs, the Ports will be reset according to the PCI g|pflss Base Specification - 
all of the Links attached to the device being reset will ber^raine^W staje machines 
will be initialized, and all TLP information will be fluslf|d!%||he H^Reset is 
propagated to a subset of the OS domains on a particular Port, 4%e Port will use Reset 
DLLPs to reset the designated part of its port lo^pertai^ing to the OS domain specified 
in the Reset DLLP. ^^^^^ 

2.4.4.1.8 Hot Reset initiated at a Dorm^e^Pori^ 



A Downstream Port can enter Hot R^s^S^ia^g i^Secondary Bus Reset bit set in the 
Bridge Control Register. The Poct^ll trantmit ^set DLLPs on the OSDs that 
correspond to the Secondary ^^Re^bits tlpit were asserted and clear all Register and 
State Machines pertaining t^e^^. ^^4ffic related to the OSDs that are not being 
reset will operate normajj^ v r 

2.4.4.1.9 Hot Reset initiated at a Shared Upstream Port 



A Shared Upstre^P^^^ntdf Hot Reset by having its Secondary Bus Reset bit set in 
its Bridge Co^t^L^g^ter. The Port will transmit Reset DLLPs on the OSDs that 
correspondjfo the^comlapy Bus Reset bits that were asserted and clear all Register and 
State ^ch^s perta^inf to the OSD. All traffic related to the OSDs that are not being 
resetAviil op^^nopially. 

2.4 4^10 Slot Reset initiated at an I/O Device 

If a devtcM^ants to reset itself, it can do so by either transmitting Reset DLLPs or by 
transmitting TSls. If the device wishes to only reset a particular OSD, it will generate 
and transmit Reset DLLPs that specify the OSD to reset. It will also clear all Registers 
and State Machines pertaining to the OSD. If the device wishes to reset the entire link, it 
will generate and transmit TSls with the Hot Reset bit asserted to reset the entire link. It 
will also initialize all Registers and State Machines and attempt to bring up the link. 

2.4.5 Power Management 

PCI Express Power Management provides the following services: 
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It allows software driven D-state transitions to change the Link power 
management states for a physical Link. 

It provides a hardware-autonomous capability to change the Link power 
management states for a physical Link (Active State Power Management). 
It provides a wakeup mechanism driven by in-band TLPs routed from the 
requesting device towards the Root Complex - these are called power 
management event (PME) Messages. 

It provides a means to change the Power Management state by generating PMEs 
per PCI Express function. 



are 



2.4.5.1.1 Link State Power Management 

A PCI Express physical Link can enter Link power management states by w . 

driven D-state transitions or by active state Link power management a^fe^es. l|$nned 
Link states include LO, LOs, LI, L2, and L3. The power saviri^mcrel|pj^l^e Link 
state transitions from LO through L3. Table 2.4-1 and Tabj^^^^umr^arizes the Link 
Power Management States for both non-shared I/O and #^^d 1/O.^lk W 





L-State 
Description 


S/W PR? 


^sed By 
^ItSPM? 


Clocks & 
Power 


LO 


Fully Active Link 


^CYes>(P0)X 


Yes (DO) 


On 


LOs 


Standby State 




y Yes(D0) 


On 


LI 


Lower PowerP 8 * 
Standby, 


%Yel|W 


No 


On 


L2/L3 Ready 


Staging-feintTok 
Poy^etRefflfeyal^ 


j^Yes 


No 


On 


L3 


Off %^ 


n/a 


n/a 


Off 



v3 

Table 2.4-1: Summar> of Non-Shared I/O Link Power Management States 





It L-State| r 
*' v Ofsgripti^n 


Used by 
S/W PM? 


Used by 
ASPM? 


Clocks & 
Power 


Shared I/O 




FulJyWive 
jLink 


Yes (DO) 


Yes (DO) 


On 


Normal Operation 


LOs * 


^SMfidby State 


No 


Yes (DO) 


On 


TLP & DLLP transmission is 
prohibited for all 
OSDs(ASPM only) 


L1-L3 


Lower Power 
States 


No 


No 


On 


TLP & DLLP transmission is 
prohibited for a specific OSD 



Table 2.4-2: Summary of Shared I/O Link Power Management States 
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2.4.5.1.2 Power Management Software Control 

One of the ways that power management states of a Link are determined is by the 
software driven D-state of its downstream component. Table 2.4-3 depicts the 
relationship between the power state of a component and its Upstream Link. A 
Downstream component can be an Endpoint or another Switch. 



Downstream 
Component D-State 


Permissible Upstream 
Component D-State 


Permissible 
Interconnect State 
(for Non-Shared I/O) 


Permissible Interconnect 
State 
(for Shared I/O) 


DO 


DO 


LO, LOs 


j LO, LOs 


Dl (optional) 


D0-D1 


LI 


LO,^fi||v(Cannot go into 
a low le%l pow^i state) 


D2 (optional) 


D0-D2 


LI 


^|^LOs (6%j^f go into 
%h5^^1, power state) 


D3 hot 


D0-D3 hol 


LI, L2/L3 Readj^L 


L&COs TfCannot go into 
j, a lcfe level power state) 


D3 co id 


D0-D3 cold 




^P^Cbs (Cannot go into 
flow level power state) 



Table 2.4-3: Relation between Power Management States of Liri&and Components 
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Upstream Component 



Downstream Component 



^Upstn am component sends 



T D P I P D T 



L configuration write request 



-mm* 
□□□!□□■ 



Upstream component blocks 
^ J scheduling of new TLPs 

I Upstream component receives 
[^acknowledgment for last TLP 

#F Upstream component sends 10"" 
|^PM_Request_Ack DLLP 

{Upstream component sends 
PM_Request_Ack DLLP 
continuously until it sees , 
electrical idle *Jflfl|Hl^ 

{Upstream component completes | 
L1 transition; disables DLLP, 
TLP transmission and brings 
Phy Layer to electrical idle 




Downstream component 
begins L1 transition process 

Downstream component 
blocks scheduling of new TLPs 

Downstream component waits 
to receive Ack for last TLP 

PM_Enter_L1 DLLPs \ 
sent continuously J 



Downstream components waits ^ 
for PM_Request_Ac£ DLLP, 
acknowledgig^Ohe 
PM_Enter_L1lbr^ 




T- 


Transaction [ 








J Active 


D- 


Data Link L 




P- 


Physical j 


| Inactiv^ 










Downstream components ^sees 
PM ^uekjAck DLLP^ables 
DLJ^f?, TLP tran|mission and brings 
Phy Lap^pjplectrical idle 




phared I/O - @ 




* Non-Shared I/O - 





Figfi^!.4-9: Entry into LI Link State 



2.4.5jk^tive State Power Management (ASPM) 

Actife State PrfwfeFManagement (ASPM) is an autonomous hardware based active state 
mecn|nism thajt enables power saving even when the connected components are in the 
DO st^^^l/operational state). After a period of idle Link time, the ASPM mechanism 
engages in a Physical Layer protocol that places the idle Link into a lower power state. 
Once in the lower power state, transitions to the fully operative LO state are triggered by 
traffic appearing on either side of the Link. This feature may be disabled by software. 

Since ASPM is initiated by the link being in idle for a specified amount of time, the 
physical layer can be placed in a lower power state regardless of whether the component 
is shareable or not. When any traffic (regardless of OS domain) appears, the link is 
placed in the fully operative LO State. 
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2.4.5.1.4 Power Management Event Mechanisms 



2.4.5.1.5 Link Wakeup 

PCI Express components are permitted to wakeup the system using a wakeup mechanism 
followed by a power management event (PME) Message. These PME Messages are in- 
band TLPs routed from the requesting device towards the Root Complex. This PME 
mechanism is broken up into two tasks: 

• Reactivation (wakeup) of the associated resources 

• Sending a PME Message to the Root Complex 

The Link wakeup mechanisms provide a means of signaling the platform*^ 
power and reference clocks to the components within its domain. Ttare are 
wakeup mechanisms: Beacon and WAKE#. The Beacon uses in-ban^i^aling i 
implement wakeup functionality. WAKE# is an input to the S^tch, a^i^^onse to 
WAKE# being asserted, the Switch must generate a Beacop^h^^roplgatea to the 
Root Complex. Y 

The Switch must translate the wakeup mechanism appr^priatel^vhen some ports use the 
k^a^n m^Uaniem or»ri ^fU^r-c .,c« \\i a VK-U Ti^rTkD wl ijj k e ep<a scoreboard of the 



beacon mechanism and others use WAKE# 
downstream ports wakeup states, and when 
Complex have been woken up, the COP 
Port. 



^-establish 
kdefiifed 



e down 





&m ports of a specific Root 
r sekd a beacon or WAKE# to the Root 




Regardless of the wakeup mechaijifb used,^ice"|l^ Link has been re-activated and 
trained, the requesting agent t^^pr^^ates ^ PM_PME message upstream to the Root 
Complex. 

A 

2.4.5.1.6 PME Messages 

PCI Express devi^^B^^^e lotified before their reference clock and main power is 
removed so ^at^py^an prepare for it. This is done as follows: 

1 . Before po^|r aftCelocks are turned off, the Root Complex (or Downstream Port) 
a PM%Turn_Off message to all agents downstream to cease initiation of 
any s^]^^uptit PM_PME messages. 

Each agfntis required to respond with a PMEJToAck TLP, which must 
y,erminjite at the point of origin, 
feath agent responds with a PME_To_Ack TLP, the TLP is received by the 
endpoint port and routed to the COP. When the COP receives PME_To_Ack 
TLPs for all of a particular Root Complex's downstream ports, the COP 
generates and sends a PME_To_Ack TLP to the Root Port. 
Once an endpoint port has sent the PME_To_Ack packet, it must then prepare for 
removal of power and clocks by initiating a transition to the L2/L3 Ready state. 
The Switch is responsible for making sure that the upstream port goes to L2/L3 
Ready state after all its downstream ports have entered L2/L3 Ready state. It 




4. 



5. 
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should not wait indefinitely for the PME_To_Ack packet, but should implement 
a timeout mechanism where it would assume that the PME_Turn_Off TLP was 
received after the timeout expired. 
6. The Power Delivery Manager must wait a minimum of 100ns. after the Root 
Complex transitioned to L2/L3 Ready before removing power and clocks. 




2.4.6 Switch initialization 

There are some basic events that must always happen in whenever our d£ 
from a software/configuration point of view. 

1 . I2C initialization 

2. Link training 

3. OSD negotiation 

4. System setup update (optional) 

5. FC initialization 

6. Pseudo-Device Discovery (optional) 

7. Device Discovery 

Each will be covered in more detail in the Mlowinglections. 
2.4.6.1 J I2C initialization 

There will be hardware defaults fo%nany oKthe rpgisters in our chip that should provide 
a functional chip at boot time.^here%|e, however, a few structures that must be 
provisioned by the I2C interface at^oot^^f^ince the hardware has no idea what type of 
system is being created. 

There will be a default biqdinf||)f OSDs to port set up by the I2C. This means the I2C 
will set up the rqcJ^g^i^^|in^Lbles so that transactions that enter the switch know 
where to go foj^e ^ajious 6*SDs on that port. The following structures must be 
provisioned^ith %^a%ipvalues by the I2C for the switch to boot: 

lappfig table (indexed by osd[3:0]) in the MAC: set to all 0s except for 
TCO wfiitBffls hardwired to 1 on VCO. Note that there is one of these tables per 

0rt j 

ffesjination QID RAM in the Address_lookup_module (indexed by 
{dest_port[4:0] 5 osd[3:0],tc[2:0]}: All entries for dest_port=16 should be 
provisioned to return some 5-bit number as the dest_qid for all OSD/Tc 
combinations the system expects to appear. Again, the valid bit should be set for 
these entries. There is only one of these in the switch. 

For example, if the EEPROM expects a 2-OSD device to be plugged into port 8, and this 
device should talk to ports 0 and 1, the I2C will write the data in the Destination QID 
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RAM at address {5'dl6,4'd0,3'd0} to a value of 0. It will also write the address at 
{5'dl6,4'dl,3'd0} with a value of 1. By writing these values (and setting the valid bit for 
each address), initial transactions will be queued in the switch core and eventually routed 
to the COP through the correct queue groups. 

2.4.6.1.2 Link training 

This protocol should follow the base specification mechanism to allow a given port to 
train as a lx, 2x, 4x, or 8x link. Since our switch is actually composed of 8x and 4x PCI- 
Express cores (an 8x core plus a 4x core will be contained in a MAC), the 8x core in each 
MAC will attempt to train first. If it trains in anything less than 8x, it'll "turn on" the 4x 
core and allow it to attempt to train. If it trains to 8x, the 4x core will netflWake up 
since all 8 lanes are in use by the 8x core. 




2.4.6.1.3 OSD negotiation 

In order to minimize the required console management sofb^afelisage, Sur device will 
support auto-negotiation of the number of OSDs that are^jifeent o^^ivln shared I/O 
port. The EEPROM will configure the allowable numbllf oFiS|SDs fc|feach shared port 
and will be loaded as the default configuration of the^stem. 



Once our switch completes link training wh^ajne^l^^feaabled (due to a plug-in card 
being added or just coming out of reset in.^e sy&tem)^t will begin the process of 
figuring out what types of devices it is coni^^ed% oil eafch port. A new procedure is 
defined to support this, using a new DBll^bat^sha4n here: 



+0 

7|6|5|4|3|2|l|0 


7 


i 

6Ri 4fei imi 


+2 

7| 6 | 5 |4| 3 | 2 


1|0 


+3 

7 1 6| 5| 4| 3 1 2 1 l| 0 


Type 
0000 0001 


& 


\> V 1 


R. 


VN 


R 


OSD Cnt 


LCFttSL 





ByteO 



Byte 4 



Type - alW|ys set t^00F0001 for an OSD negotiation DLLP 

PH ^ltse%pSD Negotiation (0 for InitOSD 1 DLLPs, 1 for InitOSD2 DLLPs) 

R-|eserved^W^ 

VN ^^ersion liumber (set to. 00 for base OSD negotiation) 

OSD <^^>r InitOSD 1 DLLPs, the number of OSDs present in the device... for 

InitOSD2 DLLPs, the negotiated number of OSDs for the link 

The protocol very closely resembles the base specification's method of FC initialization. 
For the "shared I/O base mode" negotiation, the VN field must be set to 00. This means 
that the OSD and VC will explicitly be used for all wireline communication between two 
devices. 



For "shared I/O extended mode", the VN field can be set to 01 which means. This mode 
means that the RN field will be used to explicitly map traffic onto buffer resources 
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between the link partners. This allow a larger number of OSDs to share a buffer if 
desired. 




1 . The OSD state machine will begin transmitting the same packet, an InitOSDl 
DLLP, over and over every clock cycle until it receives an InitOSDl DLLP from 
the link partner's OSD state machine. At that point, the device will check the 
value of the "OSD Cnt" field received from the link partner and compare that to 
the switch's number of OSDs (which was being advertised in the "OSD Cnt" field 
of the packets it initiates). The lesser of the two numbers will be the negotiated 
number of supported OSDs for that link. If the link partner never sgnds any 
InitOSDl DLLPs after a timer expires (3 us), it is a non-shared poll^nd our swich 
will proceed to normal flow control initialization. 

2. At that point, the OSD state machine will continue sending cc 
DLLPs, but now it'll put the newly negotiated value in the "0£ 
Once it receives a DLLP from the link partner with the s%ne nut 
Cnt" field, the state machine moves on to step 3. j4 

3. The OSD state machine now transmits the same^pat^except'll 
InitOSD2 DLLP by setting the PH bit in the Eft^LP. Tn^ ct ^ at th ^ state 
machine is now sending this type of DLLP mean^at it l|n3erstood the OSD 
negotiation procedure from its link partit^^Plift®^ also started at this point, 
and if the timer expires (3 us) befonpan lnitu§p2 is received from the link 
partner, the state machine is reset ^H^hg^proQel^begins again. If the state 
machine receives an InitOSDZfetn its^nk j^arther, OSD negotiation is complete 
and it stops transmitting In^SD2%LLl 



^ous %p&Dl 
er irf the "OSD 



tids it as an 



2.4.6.1.4 Shared resource ihitiaU%atiff$ ¥ 

Once the number of 0§Os%Jk^bwn orjf^ link, shared resource initialization begins if the 



VN field was set to 01 
skipped. 




negotiation. If the VN field was 00, this step is 



This mech^fism I|kwMie two devices to map multiple OSD/VCs onto a common 
resource, oibuffer, Mdesrfed. The results of the shared resource initialization will be 
store^f^ master fof software to read during system setup update. If any remapping of 



OSD/VCs to b! 



^is required, it can be done at that point. 



3 

This p^t^^performs the same basic steps as OSD negotiation. If a link partner does 
not respond with InitRNl DLLPs within a 3 us time interval, it does not support shared 
resource initialization. 





+0 

7|6| 5|4| 3 | 2\ 1 | 0 


7 


+1 

6| 5|4|3|2| 1 | 0 


+2 

7| 6 | 5 | 4| 3 | 2 | l| 0 


7 1 6 


+3 

1 5| 4| 3 | 2 | l| 0 


ByteO 


Type 
0000 0010 


P 
H 


R 


RNCnt 


Byte 4 


LCRC 
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Type - always set to 0000 0010 for a RN initialization DLLP 

PH - Phase of RN initialization (0 for InitRNl DLLPs, 1 for InitRN2 DLLPs) 

R - Reserved 

RN Cnt - Total number of shared buffers that can be used by the link parnet 

The step by step breakdown of the shared resource initialization protocol is shown below. 



1. The RN state machine will begin transmitting the same packet, an InitRNl DLLP, 
over and over every clock cycle until it receives an InitRNl DLLP from the link 
partner's RN state machine. j 

2. The RN state machine now set the PH bit so that the DLLPs are n-^^nitRN2 
type. The fact that the RN state machine is now sending thisjype of%||LP rqfeans 
that it understood the RN initialization procedure from its lin^^^er. ^jmer is 
also started at this point, and if the timer expires (3 us) before ^l^l^^^is 
received from the link partner, the state machine is ^|#f%^d the^rocels begins 
again. If the state machine receives an InitRN2 fr@m its lir$|partfosr, OSD 
negotiation is complete and it stops transmittin^Ti^J2 DLf 



If this step is skipped because a link partner doesnot su 




always have the RNP bit set to 0 since the 

2.4.6.1.5 System setup update 

At this point, the number of OSDs (ai 
in the link partner is known by th< 
software can see the results 




•pit it, tffe routing header will 
e used. 



er of buffer resources) available 



alues are written to a register so that 



ed the allowable OSDs on that port, the 



Based on the OSD negotiation tha_ _ _ ...„. t 

switch will write to th^6%^li^resoi^ce registers based on the results of OSD 
negotiation. These registers contain some fields that specify what the encodings are 
(internal to the s\^^)i^^oMt6 a particual OSD/VC. The EEPROM will have 



already set up defau 



lounts for both the header and data buffers. 



For exa^ipl^if tw(^SD^were negotiated, the 16 registers might look something like 
this: 




g-esourc| 0 Register: OSD=0, VC=0, valid=l 
Buffer%^ur4 1 Register: OSD=l, VC=0, valid=l 
Buffer resource 2 Register: OSD=x, VC=x, valid=0 



This means that ingressing TLPs on OSD0/VC0 will use buffer resource 0 space since the 
link partner will set the RN=0 when it sets the OSD=0 in the AS header. Incoming TLPs 
on OSD1/VC0 will have RN=1, OSD=l in the ExAS header. 

The hardware will now pause and query a control bit, halt_on_osd_complete, to 
determine what to do next. If that control bit is set (by the EEPROM during system 
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boot), the hardware will wait for software to clear it before proceding to FC initialization. 
If the bit is cleared (ie, wasn't set by the EEPROM programming), the hardware will 
proceed to FC initialization immediately. 

The purpose of this bit is to allow the system software/console management software an 
opportunity to take a look at what the results were of OSD negotiation. It can then go re- 
provision the buffer resources prior to FC initialization by reallocating space to different 
OSDs if necessary (and possibly VCs; this particular topic is covered in a sub-section 
below). Note that the hardware makes no assumptions on the "correctness" of this 
reprogramming and will make no attempt to recover from invalid programming at this 
point. 

2.4.6.1.6 Flow Control (FC) initialization 

If only 1 OSD is present as a result of OSD negotiation or ^^lj|j|otia^on was skipped 
because the link partner is a non-shared device, the state n^hine wM be]|m the normal 
PCI-Express base specification FC initialization proced|te^fewevef|ff more than one 
OSD is present, the state machine will begin "sharedT/O base F|£ initialization". If the 
shared resource initialization step was successfuj^^st^e^machirfe will begin "shared 
I/O extended FC initialization." There is alsj^anofcf^^^is^ called Buffer Retry 
Mode that is not implemented and is explained in sectipn 0. 



2.4.6.1. 7 Shared I/O Base FC 

A new DLLP is created to con 
here: 




+0 



7|6|5|4|3.|2|l|0 





initialization information. This DLLP is shown 



M Y +2 +3 

H^f%4l 3| l\ 1 1 0 | 7 | 6 | 5 I 4 I 3 [ 2 | l| pj l\ 6] SjjJ 3 | 2 1 jj 0 




ByteO 



Byte 4 



Type 
0111 0v2Vi,¥te^ 



OSD 



Credit count 



Figure 2.4-11: InitFC-H/InitFC-D DLLP format 



Type -^y|r^iiibble set to 01 1 1 for an FC initialization shared-link DLLP. The lower 
nibble specifies the VC number. 

PH - Phase of FC Initialization (0 for InitFCl DLLPs, 1 for InitFC2 DLLPs) 
TT - transaction type (00 for Posted, 01 for Non-posted, 10 for Completions) 
R - Reserved 

OSD - OS Domain, ranging from 0 to 63 to specify the unique OSDs on the link 

CT - Credit type (0 for header credits, 1 for data credits). .. this basicially identifies the 

DLLP as either an InitFCl_H or an InitFCl_D 

Credit count - contains either the 12-bit data credit count or the 8-bit (upper 4 bits are 
zeros) header credit count, based on the value of the CT bit 
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This section shows how shared-port FC initialization (where shared resource initialization 
was skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl_H and 
InitFCl_D DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link that negotiated 2 OSDs. For this example, our switch was configured 
to only enable VCO on each OSD: 



InitFCl_H (OSD = 0, VC = 0, Posted, header) 
InitFCl_D (OSD = 0, VC = 0, Posted, data) 
InitFCl_H (OSD = 0, VC = 0, Non-posted, header) 
InitFCl_D (OSD = 0, VC = 0, Non-posted, data) 
InitFCl_H (OSD = 0, VC = 0, Completion, header) 
InitFCl_D (OSD = 0, VC = 0, Completion, data) 
InitFCl_H (OSD = 1, VC - 0, Posted, header) 
InitFCl_D (OSD = 1, VC = 0, Posted, data) 
InitFCl_H (OSD = 1, VC = 0, Non-posted, header) 
InitFCl_D (OSD = 1, VC = 0, Non-posted, data) 
InitFCl_H (OSD = 1, VC = 0, Completion, header 
InitFCl_D (OSD = 1, VC = 0, Completion, d; 




So for this example, 12 unique DLLP^g^il cal^late^ie credits for each OSD/VC 
enabled. Anytime more VCs are cabled ul|pg tS%rfiormal mechanism for PCI-Express, 



this procedure is run in the same'fasnion. 



Since VCO should alwayy/^habl^ th^^itch should "expect" to receive the same 12 



DLLPs from the link parffilx 




11 ^^^lue witil ^corresponding DLLPs have been received from the 
tfia^mf^S^yiteh will begin sending the same sequence of DLLPs, 



This pattern wilf 
link parter. At j 

^nTt^C2_H and InitFC2_D DLLPs (again, all 12 in a repeating 
sequence). fPnce tn^w^eh receives an IriitFC2 DLLP from its link partner, FC 
initializitip&xomp|fete. Note that whenever the FC1 phase is complete, TLPs can 
transmittteg^)rt the link. FC2 is just used to finally complete the handshake. 

The E^^^^^will contain the pre-calculated amount of credits to advertise and load 
these values into our internal registers at boot time. This can result in non-optimal buffer 
usage in the event the default provisioning is set up for a device that does not contain the 
same number of OSDs and/or VCs. The actual hardware defaults will assume all 16 
OSDs are enabled, all with VCO. As such, the credits will be equally split across all 16 
OSDs for all transaction types. 




2.4.6.1.8 Shared I/O extended FC init 
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This section shows how shared-port FC initialization (where shared resource initialization 
was not skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl_H and 
InitFCl _D DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link where we have 2 shared resources (RN = 2) as provisioned by the 
EEPROM: 



InitFCl_H (RN = 0, Posted, header) 
InitFCl_D (RN = 0, Posted, data) 
InitFCl _H (RN = 0, Non-posted, header) 
InitFCl J) (RN = 0, Non-posted, data) 
InitFCl _H (RN = 0, Completion, header) 
InitFCl_D (RN = 0, Completion, data) 
InitFCl_H (RN = 1, Posted, header) 
InitFCl_D (RN = 1, Posted, data) 
InitFCl_H (RN = 1 , Non-posted, header) 
InitFCl_D (RN = 1 , Non-posted, data) 
InitFCl_H (RN - 1, Completion, header) 
InitFCl_D (RN = 1, Completion, data) 




2.4.6A.9 Buffer Retry Mode 

In Buffer Retry Mode, FC Initialization v^l|e^1§rfom%rf according to the way it is 
specified in the PCI Express Base Spg0fcati^^f the port is shared, it will be 
transparent at this stage - FC Initp&ation l^o^enly on a per VC basis. The 
partitioning of the buffer resoujgps p^yOSD Within each VC is determined during the 
configuration of the device-§aeci^^regi^frs/ 



In Buffer Retry Mode^tI^i^i|ile\lata^^fer is partitioned by VC as well as by OSD, and 
sets aside a "suiplus'^am^un^f memory that all types of packets can access regardless of 
OSD. The partiti0||^^^b| memory for Buffer Retry Mode is shown in Figure 12 and 
the variables tobe pQgrafmmld are shown in Table 4. 




P2 TOTAL MEM< 



* In default mode, there are 
no further partitions beyond 
the segments shown above. 



Retry Buffer Mode* 



P1_OSD(0)_RSVD_MEM 



P1_OSD(1)_RSVD_MEM 



P1_OSD(m-2)_RSVD_MEM 



P1_OSD(m-1 )_RSVD_MEM 



P1_VC(1 )_SURPLUS_MEM 



Where: 

n = number of VCs 
m = number of OSDs 

" In retry mode, the memory is 
partitioned as shown above. 
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Figure 12: Breakdown of the Data Buffer 



Since Flow Control DLLPs report credits solely by VC, and the Buffer Retry Mode 
breaks the buffer down even further into OSDs per VC, a packet could be transmitted that 
had enough flow control credit, but would still not get stored in the data buffer. This 
would happen if the packet belonged to an OSD that no longer had free space in its 
section of the data buffer, even though there appears to be credit according to flow 
control. 



Instead of dropping packets that could not get stored in the data buffer dd^<|Jlack of% 
resources for a particular OSD, the Ack/Nak DLLPs are used to inform the otlW endibf 
the link to retry the packet. According to the PCI Express Base Specification, anlr 
Ack/Nak DLLP is generated in the receive side of the Data Lijj^ayer^tti^l^insrnitted 
to the other side of the link to inform the transmit side of theiDaMLink Layer to either 
free its buffer resources that corresponds to the packet (ifij&k'd beSause "there were no 
data integrity errors), or retransmit the retry buffer (iO^C k'd%gcause IJtere were data 
integrity errors). If repeated attempts to transmit a TLP^are ur!i|g£§ssful, the transmitter 
will instruct the Physical Layer to retrain the link^ 




The VMAC takes the Ack/Nak DLLPs to anothe^ levfel^by allowing the Rx Transaction 
Layer Module to alter the DLLP dependii^liii^^thpr |Re TLP is able to get stored in 
the Data Buffer. If the TLP cannot b^i§l%d^i^Ac0DLLP that corresponds to the TLP 



is changed to a Nak DLLP and \sp 



to thither side of the link. One of the 



reserved bits in the Nak DLLP^ill S^be selto differentiate between a Nak caused by a 
data integrity error (which cp^utti|gat^Wtoin the link) and a Nak that is caused by a 

p in red in Figure 13 (Note: Using reserved bits 
'served fields are ignored and will not cause 



as sr 



buffer retry condition. Tl* 
is acceptable since nogfzef^^lles in* 
errors with other PCI E^es^^|iks). 



If the TLP has^^ugl^resoi 



3S in the data buffer, but is stored in the surplus segment, 



one of the 0erved^its% the Ack DLLP will be set to notify the other side that the OSD 
corresponding to tha%pamcular TLP is reaching buffer saturation. This bit will be used 
by th^TOnsa^n Arbiter in the Tx Presentation Module. The Transaction Arbiter will 
skip|he Transa6t*fe~ff Queue that corresponds to the particular VC/OSD group on the 
proximate turn] This will give the Rx data buffer some time to free up its resources. 



^Programmable Variables T 


Mode* 


Width 


: v Des?ripti6-»r < * • " ' ' r 


Pl_TOTAL_MEM 


Default 

& 
Buffer 
Retry 


00 
01 
10 
11 


32KB total memory allocated to Port 1 
64KB total memory allocated to Port 1 
96KB total memory allocated to Port 1 
128KB total memory allocated to Port 1 


Pl_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0= 128KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x3F= 126KB 
where 
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Pl_VC(n)_SURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0= 16KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x7 = 14KB 

where PI VC(n) SURPLUS MEM < PI VC(n) MEM 


P l_OSD(m)_RSVD_MEM 


Buffer 
Only 


6 bits 


0x0= 128KB, 0x1 = 2KB, 
where 


P2_TOT A L_M E M 


& 
Buffer 
Retry 


IN/ r\ 


P9 TOTAT N/fFM = 1 9Rk r R PI TfYTAT MFM 
rZ 1 Ul 1V1C1V1 1 ZoJVD rl_lUl /\L, iVlClvl 

A 


P2_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0 = 128KB, 0x1 = 2KB, ^\ 

0x2 = 4KB, 0x3F = 126KB \ 

Where IF 


P2_VC(n)_SURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0 = 16KB, 0x1 = 2Kjf , \P^W 

0x2 = 4KB,... ,0x^4% \ 

where P2 VC(n)^RPLU^IEMI< P2 VC(n) MEM 


P2JDSD(m)_RSVD_MEM 


Buffer 
Retry 
Only 


6 bits 


0x0 = 128KBy^^|gKB, %" 
0x2 = 4KB^^x3F%26KB ' 
where il"* 



Table 4: Programmabt^yar^t^les^f^^)ata Buffer 



ByteO 



Byte 1 



+0 

7|6|5|4|3|2|l|0 


7|*/*|4^kl2W0 


+ 

7 1 6 | 5 | 4 


2 

3 | 2 | 1 | 0 


+3 

7|6|5|4|3|2|l|0 


0000 0000 -Ack \ Reserved 

0001 0000 -Nak M W Y 


AckNak_Seq_Num 


16bCR<X 

1 — "kw' 1 





\Figure 13 : DLLP Format for Ack/Nak Packets 





SPseudo-device discovery 

fit is done, the devices are ready to exchange TLPs, so device discovery 
can proceed. This is another optional checkpoint at which we could implement another 
halt bit, halt after fc init. If the I2C set this bit at boot time, the hardware will now 
pause again. Console management software can now come in using the I2C bus and use 
the COP to emulate device discovery by acting like a Root Complex. It could send typeO 
and typel config cycles to the newly-plugged-in I/O device and figure out what VCs, 
buffers, and OSDs that device had available. It would load this information into local 
memory and essentially restart the whole process again for that I/O device. 
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This is an advanced feature that allows software the hooks it would need to set up optimal 
resources across VCs and buffers before real device discovery happens from the OS. So 
the steps would be as follows: 

1. I2C initialization 

2. Link training 

3. OSD negotiation 

4. FC initialization 

5. "Pseudo" Device Discovery 

6. Load results into local memory of console processor 

7. Link training 

8. OSD negotiation 

9. System setup update (use newly calculated optimal buffer sei 

10. FC initialization 

1 1 . OS Device Discovery 



2. 4. 6.1.11 Discovery 

The discovery mechanism for shared devices us^Ae standard P^I-Express mechanism 
for discovery per OS domain. This mechanisjfri s^^^^pg^and Typel configuration 
cycles to determine which devices, if any, are pr§ senM>ehind a link. Each one of these 
cycles is expected within the switch for e^^^^% pdmain. 

A root complex will begin sending^peO Q^G i^lls to its southbound PCI-Express 
port. It will discover the switch^or^irectlyBonnected to the switch. It will discover 
the switch as a PCI-bridge, ^^n<^)th^t^^i^es on that link. It will then initiate Typel 
CFG cycles to discover t^devjce^^inOTiat PCI Bridge on that port. Once inside the 
switch, the Typel CF^c^fe^rill dillsver all ports and PCI Bridges that have been 
assigned to that OS domain, ^shared port will appear as assigned to that OS domain. 




When a root c^^le\cfisd®Wtes a shared port, it has no knowledge that the port is shared. 
As such, thj^poft^^l appear to that root complex as a PCI-Bridge, the same as all other 
ports. In tl%i initial lf|itcl^implementation, there will be 16 OS domains possible. Each 
root ^li^lexWill be Wpped to one of those OS domains. This mapping will have a 
specific AS tum^ocrl encoding as a 16 port switch. From the switch view, it will always 
have%J 6 port Witch as its link partner. 

For CFG cycles sent to a shared link, the CFG cycle will be encapsulated within the AS 
PEI8. This will be sent with the turn pool encoding assigned to the appropriate OS 
domain. The response to that CFG cycle will depend on the number of real OS domains 
supported by that link partner. For a given shared link partner, it will support a certain 
number of OS domains. The shared link partner will only respond to CFG cycles which 
are mapped to OS domains that it supports. 

In the following diagram, there is a 4 OS domain shared controller tied to the switch. 
The switch supports 16 OS domains, enumerated as a 16 port virtual AS switch. The 
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Shared controller supports 4 OS domains, enumerated as a 4 port virtual AS switch. The 
switch always sends packets encapsulated to a 16 port virtual AS switch. The controller 
always sends packets encapsulated to a 4 port virtual AS switch. If a port ever sees 
packets encapsulated with turn pool beyond the range it supports, the packets are no 
responded to at the transaction layer. 



This method allows standard PCI discovery to work. In the example below, each OS 
domain maps to a given virtual AS switch port. Root Complex 1 initiates a Typel CFG 
cycle destined for the shared controller. The switch changes the Typel to a TypeO cycle 
at PortlO prior to sending the CFG cycle to the controller. This CFG cycle/is 
encapsulated to a single virtual AS port, and the controller responds by s^^ng the 
response on the corresponding return virtual AS port. 

If a Root complex was to exist that was mapped to a virtual Alport tKattSllporitro Her 
does not have, e.g. virtual AS port 10, the controller would dr^Jjje packet, m the 
transport level, this would be the equivalent of a timeout ^^me CF^rea^and the Root 
complex would assume no I/O device is present on that^)gi^ 1 
allows all devices to be discovered in the same methdC with 
know the full fabric topology. ^ 



PCI This mechanism 
O device required to 



Root Complex 1 
Shares 10G NIC 



Root Complex 3 
Shares 10G NIC 





6 

*iic 



i4 IF 



gprt 



fRoot Complex 4 
Shares 10G NIC 



Port 6 



©;1 


j 11 














m 









Root Complex 2 
Shares 10G NIC 



Port 5 



Switch Port 10 



4 OS Domain 
Sharable 10G NIC 



Figure 2.4-14: Example of a Shared Switch 
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2.5 Error Handling 

Two aspects of a PCI Express switch make its error handling behavior much different 
than that of a PCI Bridge: 

• First, transient link errors can typically be corrected automatically by hardware in 
the Data Link Layer. This eliminates the need to report these errors by setting any 
bits in the PCI status registers and eliminates any problems produced when the 
failed packet was being forwarded on behalf of another device. This is because 
the packet will ultimately be transmitted correctly, so no error handling 
procedures beyond the scope of a single link need to be considered. Repeated 
failures will ultimately cause a fatal error to be recorded and the^^lty link will 
be shut down. 

• Second, non-posted transactions are managed with a xomp1|1^|j^esilage that 
eliminates the need for a forwarding agent to kee^ff^c of\rfpecllll response 
messages. So a switch can blindly forward trans^^onsviAout^ietermining the 
message type or worrying about mapping errors#atf(y:o the omgMating master. 



•t, be associated with a 
SD and error messages 



When a Data Link Layer or Physical Layer erroi^v^ich cal 
particular packet or OSD, the error is logged seggf^^y each 
if any are sent to each port sharing the port w^an^^^^^ 

The Nexis Switch implements the PCI j||pre^ Error Reporting Capability 

registers in addition to the PCI Expres§^asi%?^ror reporting registers and the legacy PCI 



mapped error registers. The Ad^ced^^ofl^p(Drting Capability registers provide 
detailed error logging, error maskih^nd erfer severity control registers. It also provided 
a header logging register fof%he nW unlnasked error that's logged. Refer to the 
Configuration Registers se^tibn fbi|^eta^#description of the Advanced Error Reporting 
Capability registers. 



2.5.1 Error Types 

PCI Express eiro^we%ass^ecf as Correctable and Uncorrectable errors. Uncorrectable 
errors are fyftfl^^as%|fied as Fatal or Non-Fatal errors. Errors are also classified based 
on the soi^pe of tm Qrpfr as Transaction Layer Errors, Data Link Layer Errors and 
Physkjafe^^r Errot|. The following sections describe each of the errors and how they 
are handling i^to^fexis Switch. 




v ^Cop?ectable Errors 

Correctableerrors are those which are localized to a single PCI Express link and can be 
automatically corrected by hardware. All correctable errors are automatically corrected 
by a retransmission of the faulty packet. An ERR COR message reporting the occurrence 
of the error may optionally be sent to the root complex. The message is sent only if the 
error is not masked and the SERR Enable bit is set in the Command Register and Bridge 
Control register. 

2. 5.1. LI Physical Layer Errors 
• Receiver Error 
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Physical Layer receivers may optionally check for errors and report them by sending 
an ERR_COR message to the root complex. Any DLLP or TLP being received that is 
in error should be discarded and any storage allocated made available. The error will 
be automatically corrected when a NAK DLLP is received and the packet is re- 
transmitted. 

The Physical Layer will check for disparity errors and invalid symbols. If any of 
these errors occur, the Physical Layer will report it. If the error (either one) is part of 
a valid TLP, the Rx Link Layer will send a NAK DLLP for the corresponding TLP 
and will not report an error since it was already done by the Physical Layer. If the 
error is detected in the Data Link Layer in time (within 3 clock cyclesjrom the start 
of packet), the packet will be purged and the Rx Transaction Layer wilT^ver sekthe 
packet. If the error is not detected in time, the TLP will be forwarded to 
Transaction Layer with an error indication at the end of the packJp^* 



2.5.1.1.2 Data Link Layer Errors 

• Bad TLP 

This error is set when the link layer detects a 

o BadCRC 

o Incorrectly nullified packet (TLP^ehds 
inverted) 



Incorrect packet sequence^^^en^pt duplicate) 




e of the following. 



ut the LCRC is not 



This error occurs when a CRC^d^heckrails onl DLLP. 
• Replay Timer Timeout*K ^\ 
This error occurs when Jl^^gLA^^lMER has been exceeded by a given TLP, which 
occurs when no AC&4gr N^^DLL^is received within that time period. This error is 

iijg the TLP and forcing a re-transmission. 



automatically corrected ^y^N^ 
• REPLAY NtT 



This error (^bur r ^^eri^given TLP was unsuccessfully retransmitted REPLAY_NUM 
times. Thi^onditidkis pdtomatically corrected by signaling the Physical Layer to 
retra^me^hi|k, Onc^ retraining is successful, the TLP can again be retransmitted (and 
REE.AY_Nulw^set). 

i 

2.5.1^U^,drrectable Errors 

Uncorrectable Errors are those which disrupt the functionality of the PCI Express port but 
cannot be corrected by hardware. Using the Advanced Error Reporting Capability 
Registers each uncorrectable error can be configured to be sent with an ERR FATAL or 
ERR NONFATAL message to the root complex or can be masked off from sending a 
message. The error messages are sent only if the SERR Enable bit is sent in both the 
Command Register and Bridge Control Register. 



2.5.1.2.1 Physical Layer Errors 
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• Training Error 

This error occurs when a device fails to establish a link with its partner. An error 
message is sent to the root complex if enabled, and the link is taken down. 

2.5.7.2.2 Data Link Layer Errors 

• Data Link Layer Protocol Error 

This error is caused when the TX MAC receives an ACK/NAK DLLP with a 
sequence number that does not correspond to any of the packets in the retry buffer. 

2.5.7.2.5 Transaction Layer Errors 

• Unsupported Request 

An Unsupported Request Error is generated for the following conditi. 



A request cannot be mapped to any address space mapped tyitpaglti&r device or 
to any egress ports. 

Downstream port of the switch receives a ^f^gurati<OT^qiiSst with Device 

andfeot issue it on the 

link. 



number 1-31. The port will terminate th^f^ns; 






smitted across the link, but 



ot verify the ECRC for forwarded 



o A packet if forwarded to an egress^o^t&^b; 
the link is in DL_Down state. 

• ECRC Error 

Logging this error is optional. The Npg§l$j 

packets. All transactions originatin^om tl^swl^li (COP) will not contain ECRC. All 
transactions destined to the switch (T%mel/Ty©eO he; 



(T^el/Ty^eO headers and Device specific registers) 



that have ECRC will not b^ecl^Tll^BkC Generation Capable and ECRC Check 
Capable bit in the Advan^^^^r (^^bility register is hardwired to zero 



Malformed fBj* 

The rec^f^^^f T^P packets generates this event when an inconsistency in the 
formation of a%acktef is detected at the receiver (destination). There are several 
co^^it{oi|| that require detection and reporting, some others are optional. The table 
slow shd^s^g^onditions that are supported and not supported by the switch. 



lalformed Packet Errors 



Supported? 
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Data payload exceeds max payload size 


Yes 


Actual data length does not match data length 
specified in the header 


Yes 


Start memory DW address crossing a 4KB 
boundary 


No 


TD field = 1 but no ECRC 


No 


Byte enable violation detected 


Yes, only for writes to COP 


Packets with undefined type field 


Yes j- 


Multiple completions with read data that 
violate RCB 


No 


Completions with a configuration request retry 
status in response to a request other than a 
configuration request 


No (is this^ption^^^^^^,^ 


TC contains a value not assigned to an enabled 
VC within the TC/VC mapping for the 
receiving device 


XV > 


Transaction type requiring use of TCO has*TC x 
value other than zero. jF fc 


%^^beV2) 


— kr 

Routing is incorrect for transaction type^s^: 
transactions requiring routing tOsR@%itect^ 
moving away from RC) ^ 


'No^rnaybe V2) (this is an 
/ALM check) 


Msg/MsgD messages with^00bl%yting if 
received at upstream PJ0L 


?? (ALM check) 


Msg/MsgD message^with 01 iFtoutilg 
received at dowQgfrel^Q^t* w 


?? (ALM check) 



A malformed Tfcv?i%|e, dltcarded and an ERR NONFATAL or ERR FATAL 
message m|y^b^ent^ the root complex. No Nak DLLP is sent in response 
Malformed|TLP, $M tne^flow control credits are not updated. A Completion Resj 
H 1 s c "ion-posted transactions with a malformed TLP. 



to a 
Response 



R^eiver Overflow 

A FS^^er may optionally check for Receiver Overflow errors (TLPs exceeding 
CREDITS ALLOCATED). If this condition is detected, TLP(s) are discarded 
without modifying the CREDITS_RECEIVED and any resources that had been 
allocated for the TLP(s)'are de-allocated. (Not supported right now in the FPGA) 
Completion Timeout 

When a non-posted transaction fails to return a completion message within the 
subscribed time limit, then a completion timeout error has occurred. The Nexsis 
switch will not master any non-posted transactions, and so will never generate a 
Completion Timeout error for any packets going off-chip. 
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• Completer Abort 

This error occurs when the Completer of a request is unable to process the request 
due to a component-specific error condition. There are no known conditions for 
which the switch will generate this error. 

• Unexpected Completion 

This error occurs when a completion message is received that cannot be matched to 
any outstanding requests. The switch will report this error only for the completion 
targeted to the switch as an endpoint. The ALM detects this condition and notifies the 
MAC to log the error. 

• Flow Control (FC) Protocol Error 

The receiver of Flow Control DLLP packets generates this event^jgj^a viol^ti^n of 
the flow control protocol is detected at the receiver (destination). Th^^^several 
conditions that require detection and reporting, some ot^^^^ optional. The 




following conditions may be checked and violations m^orted.^^ \* 
During FC initialization for any Virtual Channel t^e v^must advertise credits 



equal to or greater than the minimum for that FG%pe 
• IF an Infinite Credit advertisement (value ^fi^^^a^been friade during 




initialization, THEN any future update cre&it values rmisM>e set to zero. 

Poisoned TLP "^./^ \ 

A poisoned TLP is one where theJij^fcld^^the header is set indicating that the TLP 
is known to contain an error. I^nie pacTfef^is%||eady poisoned the switch will not 
issue an error message unless it ^ti^ finl| target of the poisoned TLP. 

If the error is an uncorre^tab^ECC ^onn the internal data buffers of the switch for 
one of the transit pack^s the swiMi sjefs the EP bit in the header, logs the error and 
forwards the packed 

When a switcJLfowal^s a ^i§oned TLP, the receiving side must set its Detected 
Parity Error bit^MSfil^nsmitting side must set its Master Data Parity Error bit if 
the Pari^lE?%^Response bit in the Bridge Control register is set. 




2. 5. lJM^&ttg- Through Error Handling 

If th| cut_thr^%nbJ^bit is asserted in the Nexsis Switch may, a TLP could be forwarded 
from||he ingrdjps port to the egress port before it has been completely received on an 
ingresi^ift.^This complicates the error handling in the conditions where the TLP would 
otherwise^ discarded. If an error does occur on a cut-through packet after it has begun 
transmission out an egress port, then the TLP must be 'nullified' to indicate to the 
receiving device that an error has occurred. A TLP is nullified by either using the 
inverted value for its LCRC or by signaling the physical layer that it must use an EDB 
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symbol instead of an END symbol as the final framing symbol. The ingress port returns 
a NAK DLLP to the TLP source and the egress port purges the packet from the Replay 
buffer. When the endpoint finally receives the TLP and detects the EDB symbol and the 
inverted CRC, it purges the packet and does not return a NAK DLLP. 

2.5.2 PCI 2.3 Error Reporting 

All PCI Express ports will update the error status bits in the PCI 2.3 configuration space 
as appropriate in order to maintain compatibility with legacy drivers. Note that these 
status bits are independent of any status maintained in the Advanced Error Reporting 
Status or control registers. In particular, setting or clearing an Advanced Error Reporting 
Status bit should not clear the corresponding bit in the PCI 2.3 configuraften registers - 
these must be left to be explicitly managed by software and is performed by t^COK- 

Each PCI Express port connects implicitly to a PCI-PCI virtual bra^^JEacK^^ these 
bridges will implement a complete and independent PCI-PCL^dge ^e^^^ifiguration 
space. 

Note that the primary bus of each bridge is the one clo 
this can vary depending on how the Nexsis Switch 

The following sections detail how the Nexsis 
compatible with legacy PCI 2.3 software 




the R^^^mplex, and that 
a syslfm. 

shoula%tfport errors to remain 






2.5.2.1 Primary side of P2P Brid 

• Detected Parity Error jjfflQj^ 
This error will be set when ev.eMhe pririi^ry^e of the internal P2P bridge receives a 
poisoned TLP. In the Nex^M swlfe^i the RX ingress MAC (Root Port) would set the 
Detected Parity Error b^fein w^Pn^p^Status Register when receiving a poisoned 
TLP 

• Signaled System Mq 

The TX M 
uncorrectable e! 

• Receivedittl 



gam port must set the Signaled System Error if an 
RR_FATAL or ERR_NONFATAL) is transmitted. 

Abort 

;p b^ detected only by PCI Express device which originally initiates 
ence not applicable to the Nexsis switch. 

arget Abort 

Igrro^needs to be detected only by PCI Express device which originally initiates 
a transition and hence not applicable to the Nexsis switch. Signaled Target Abort - 
hardwired to 0 since we will not abort any transactions as an endpoint. 

Master Data Parity Error 

This error is detected when forwarding a Poisoned TLP from the secondary side of 
the bridge to the primary side. In the Nexsis switch the TX MAC of the upstream port 
must set the Master Data Parity Error bit in the Primary Status register if the Parity 
Error Response bit in the Command Register is set. 
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2.5.2.2 Secondary side of P2P Bridge 

• Detected Parity Error 

This bit is set when the secondary side of P2P bridge receives a poisoned TLP. In the 
nexis switch the Rx MAC of the downstream port must set the Detected Parity Error 
bit in the Secondary Status register when it receives a poisoned TLP. 

• Received System Error 

The Rx MAC of a downstream port must set the Received System Error if an 
uncorrectable error message (ERR_FATAL or ERR_NONFATAL) is received. 

Received Master Abort 

This error needs to be detected only by PCI Express device which orip 
a transaction and hence not applicable to the Nexsis switch. 

• Received Target Abort 






initiates 



ich oilgina 

V 



y initiates 



,rget Abort a 



This error needs to be detected only by PCI Express devj^e 
a transaction and hence not applicable to the Nexsis sj^tch. 

Signaled Target Abort 

There are no known conditions for which the switc 
transaction targeted to it. 

Master Data Parity Error 

This error is detected when forwardin^^o^ned ^feP from Primary to Secondary. 
In the Nexis switch the TX dowi^fcej^N1%C wgjuld set the Master Data Parity Error 



bit in the Secondary Status regjger whe%|pn%iftting a poisoned TLP 




4, ~% I 

2.5.3 Error Reportiggf 

Errors may be report.^ h^^ter getflrating an explicit error message, or through the 
Completion Status field^n itSpmpletion header. A completion response is used for 
reporting errors ^|^Jjg^T|om jr non-posted request, while explicit error messages are 
used for all type| of mdtsages. Note that a Completion Status may only be used by 
the intende/targe^^tfe^riginating message. Thus, a non-posted message that is being 
h the s%itcft will never have a Completion generated by the switch, and so 
^d f(0 that message will use explicit messages. The only messages that 
Comf^ion response to report errors will be those messages that are targeted 

ilters of the switch itself. 

■f 

Note also that only Unsupported Request and Completer Abort errors are reported in a 
completion response. All other errors, even for non-posted requests, will generate an 
explicit error message. 

2.5.3.1.1 Completion Status Response 

The format of a completion header used to respond to an error condition is as follows: 




+0 


+1 


+2 


+3 
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7 


6 5 


4 3 2 1 0 


7 


6 5 4 


3 2 10 


7 


6 


5 


4 


3 2 


1 0 


7 


6 5 4 3 2 1 0 


R 


Fmt 
0 1 


Type 
10000 


R 


TC 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 


Completer ID 


Compl. 
Status 


BC 
M 


Byte Count 


Requester ID 


Tag 


R 


Lower Address 



Figure 2.5-1: Format of Completion Header 



The following table shows the fields and their values for completion headers reporting 
error conditions: / 



Field 


bits 


Description 


Value 


Varilfe 




Length 


9:0 


Always zero for error 
completions 


o , 


—J- % 

<%^No y 


w 


R 


11:10 


reserved 




N%/ 


¥ 


Attr 


13:12 


Copied from request header 






EP 


14 


Indicates TLP is poisoned jfij^t 


\: 


^No 


TD 


15 


Indicates presence of TLB-digest 


V 0 


No 


Reserved 


19:16 


?^g£ liSI — 

reserved %&r* 


— — z 


No 


TC 


22:20 


Copied from request header ^m&J 


f 


Yes 


R 


23 


reserved S 1 ' 


0 


No 


Type 


28:24 


Indicates%4sg%ype 


10000 


No 


Fmt 


30:29 


In^d^^^J^y he^f/no data 


ObOl 


No 


Byte 
Count 


43:32 


The remaitfeg byte count for 

^ 




Yes 


BCM 






0 


No 


Compl. i 
Status**"^ 


f — 

|47:45 


1|)0 ^Successful Completion 
0|1 = Unsupported Request 

JM0 = Configuration Request 
Retry Status 
100 = Completer Abort 




Yes 


Complefer^ 
ID 




Bus #, device # and function # of 
unit generating completion (and 
reporting error) 




Yes 


Lower 
Address 


70:64 


Unused 


0 


No 


R 


71 


reserved 


0 


No 


Tag 


79:72 


Copied from request header 




Yes 


Requester 


95:80 


Copied from request header 




Yes 
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ID 











Table 2.5-1 Completion Header Fields 
2.5.3.1.2 Explicit Error Messages 

Explicit error messages are generated for every correctable or uncorrectable error which 
is not masked in the Advanced Error Capabilities register. For shared ports an error 
message for an error from a port which cannot be associated specifically to an OSD (for 
example, Training Error or Reply Timer Timeout) s hould be sent to all the RCs sharing 
the downstream switch port, j 



Error messages generated are one of the following three types: 



Type 


Description j£ 


ERRCOR 


Issued when component or device detgte^slS^ 
correctable error on the PCI Expres^L 
interface y\ %^ 


ERRNONFATAL 


Issued when the component or^dfesyce t 
detects a Non-fatal, unconfutable £faor on * 
the PCI Express interfap© % "^^^S^ 


ERRJATAL 


Issued when the corg^pnent or cfeyice 
detects a Fatal, uncoi^t^rf?e* efror&n the 
PCI Express i R tffi%^/%, f 



Table 2:5-2^CI Express Error Messages 
The format of all error mess^ges^^hov^jn^lhe following table: 




+0 




+2 


+3 


7 


6 5 


4 3 2 .^ffl 


Wm 




^3210 


7 


6 


5 4 


3 2 


1 0 


7 6 5 4 3 2 1 0 


R 


Fmt 
0 1 


pooo % 


\ 


TC % 

wo 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 
0 


%^Reque 


|ter ID 


TagO 


Message Code 



Figure 2.5-2: Format of Error Messages 
All error messages are 16-bytes in length, with the following fields and values: 



Field 


bits 


Description 


Value 


Variable 


Length 


9:0 


Unused - always zero 


0 


No 


R 


11:10 


reserved 


0 


No 


Attr 


13:12 


Attributes - always zero 


ObOO 


No 


EP 


14 


Indicates TLP is poisoned 


0 


No 
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1 U 


i c 
1 j 


jnaicaxes presence 01 li^r uigest 


U 


XI r\ 
1NO 


Keserveu 


1 Q« 1 A 


reserved 


u 


INO 


I u 


zz.zv 


i rainc ciass, must oe zero 


u 


XI r% 

INO 


R 


J 23 


reserved 


0 


No 


Type 


28:24 


Indicates 'Msg' type 


10000 


No 


Fmt 


30:29 


Indicates 4 DW header, no data 


ObOl 


No 


R 


31 


reserved 


0 


No 


Message 
Code 


39:32 


0011 0000 = ERR_COR 

0011 0001 = ERR_NONFATAL 

0011 0011 = ERR_FATAL 




Yes A 


Tag 


47:40 


Unused (no completion required) 


0 , 




Requester 
ID 


63:48 


Bus #, device # and function # of 
unit reporting error 







Table 2.5-3 Error Message Fields 



ll^j^s pecj re q li i re s %1 1 |en-(^. mefsages^fusg 
a 4 DW header. This Error Message field is takj 
of the first packet in error is stored. 





g Register where the header 



indicatefthe message routing mechanism, which is 
iicafittg message should be routed to the Root Complex. 



2.5.3.1.3 Message Routing 

The three lsb's of the message type^fiH 
always '000' for error messages ; 

2.5.4 Header Loggj^^ ^ 

As part of the Advanced fiWr Repeating capabilities the Nexsis switch logs the TLP 
header for the first unco^ecmt^- transaction layer error reported. Headers are logged 
only if the mask S^^Sth^l^itesponding error is not set in Uncorrectable Error Mask 
register andjtfi^i^ Status bit pointed to by the First Error pointer is not 



set. Headed are log||d lr^a 4 DWORD register for the following errors, 
isljked TLIfreceived 
ECRC%feek Failed ( error is not supported) 
Jnsupjorted Request 

)letion Abort (error is not supported) 
Unexpected Completion 
Malformed TLP 



There are no variations in header logging logic for shared or non shared port. 
2.5.5 Error Tabl s 
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The following table lists the errors that may occur and be detected on a single PCI 
Express link. 



Layer 



Error Name 



Default Severity 



Detecting Agent Action 



Receiver Error 



Correctable 



Send ERR_COR to Root Complex 

(m 



Training Error 



Uncorrectable 
(Fatal) 



Send ERR FATAL to RC 



D 



Bad TLP 



Correctable 



Send ERR COR to RC 



Bad DLLP 



Send ERR COR to R' 



Replay Timeout 



Send ERR COR to 




REPLAY NUM 
Rollover 



Send ERR CORjo RC 




Data Link Layer 
Protocol Error 



Uncorrectable 
(Fatal) 



Send ERR^ATA^ft^' 



Poisoned TLP 
Received 



Uncorrectable 
(Non-Fatal) 



ECRC Check 
Failed 



Sencyto_N(^AlML to RC 
L(?#hea%ofTLy 



Unsupported 
Request (UR) 



Completion 
Timeout 



Completer Abort 



Unexpected 
Completion 



Receiver Over 



Flow Contrail % 
Protocol ErrorX 





Sen^ERR^NFATAL to RC 
d*Qg ttoader ofTLP 



*^nd«t_NONFATAL to RC 
Lbe header of TLP 



Sfenl^RR NONFATAL to RC 



Send ERR NONFATAL to RC 



Send ERR NONFATAL to RC 
Log header of Completion 



Send ERR FATAL to RC 



Send ERR FATAL to RC 



Send ERR_FATAL to RC 
Log header of TLP 



Table 2.5-4: PCI Express Link Errors 

Occuffen&e ^fany olfthe above correctable errors can be flagged in the Correctable Error 
Stat||> Register aiSa^masked in the Correctable Error Mask Register. 

Occufence of afiy of the above uncorrectable errors can be flagged in the Uncorrectable Error 
Status I^^]g^nd masked in the Uncorrectable Error Mask Register. Additionally, the 
uncorrectable errors can be programmed to be reported as either a fatal or non-fatal error by use 
of the Uncorrectable Error Severity Register. These registers are replicated for each OSD. 

2. 5.5. LI Error Signaling and Logging 
Legend: 

Type: C=correctable, NF=non-fatal, F=fatal - indicates type of error message that is generated 

S= Supported N=Not Supported 

Italicized errors correspond to specific error bit set 
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2.5.5.1.2 Rx Physical Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


Invalid symbol or 
disparity error 


S 


Discard pkt if error is part of 
pkt. Schedule Nak if TLP 


Rx Error 
Rx Port Error 
reported by PLM 
(does not report if 
it was part of a pkt 
or not) , 


C 


Link Error 


s 




Receiver 2$%>r 


c 


Training Error 


s 


N/A 


Training Errw^ 





2.5.5.1.3 Rx Data Link Layer Errors 
Table 2.5-6 Rx Data Link Layer Errors 



Table 2.5-5: Rx Physical Layer Errors 




Error 


S/N 


Packet Behavior x 


■(Reported Error 


Type 


Invalid sequence 
number on ACK 
or NAK DLLP 


S 


Discard DLLP V 

^^^^^^^ 


DpL' Protocol 
Error 


F 


Duplicate Seq. 
Number on TLP 


s 


Discard TLPfl§hedule 


No error 




Unexpected Seq. 
Number on TLP 


s 


Disc^tLP^^dTy" 


Bad TLP 


C 


LCRC error on 
TLP 


s 

.A 


J^slkd T^l^upfess it's cut- 
ll^ou^L slpSule Nak 

ypjkp. V 


Bad TLP 


C 


LCRC error on 
DLLP j& 


\ 


Tffiscard E)LLP 


Bad DLLP 


C 


DLLP w/ " ' 
unsupportedl!yp(Pi|fc 
encodings §^ 




%iscard DLLP 


No associated 
Error 




FC Iniff rot% 
viol||ions 




N/A 


DLL Protocol 
Error 


F 


RxI %T 8 J 

Violations 


s 


Discard pkt, send Nak 


Bad TLP 


C 


TLP with EDB 
and inverted 
LCRC 


■? 


Discard pkt. No Nak 
scheduled 


No associated 
Error 




Nullified TLP 
without EDB 


S/N? 


Discard TLP 


Receiver Error 


C 



Table 2.5-7 

2.5.5.1.4 Rx Transaction Layer Errors 



Error 



Opt 1 Packet Behavior 



Reported Error [ Type 
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Unmapped address 


S 


Discard TLP 


Unsupported 
Request 


j F 


FC overflow 


N 


Discard pkt. No Nak, No FC 
update. 


Receiver Overflow 


F 


Malformed TLP 1 


S 


Discard TLP, No Nak, No 
FC update 


Malformed TLP 


F 


Pktjength field < 
actual pkt length 


S 


Truncate pkt (or discard if 
S&F). 


Malformed TLP 


F 


Pktjength field > 
actual pkt length 


s 


Stop at END. Discard if 
S&F. 

No Nak, No FC update 


Malformed TLP 

—%SL 


F 


Unexpected 
Completion 2 


s 


Discard completion 


Unexpected 
Compiefifyf!^ ^ 


JNF 


Request violates 
programming 
model of Rx device 


N 


Discard request 


Qbmp letter Jib&&f 

K V. 


NF 


Rx device unable to 
process request due 
to device-specific 
error condition 


N 


Discard request y\ ^ 

\ 


^^mplepf Abort 


NF 


Advertising more 
than 2048 credits 
for pay load and 128 
for header 


sm 




Flow Control 
Protocol Error 


F 


Did not advertise 
FC credit values >= 
min defined in 
Table 2-27 of PCI 
Express spec 


s/n?: 

j 

V 


V 


Flow Control 
Protocol Error 


F 


Non-zero credit >S|SB 
values recei^2 
after infinite credits, 
advertised % 1 


*S/N% 

\ 

, r 


W 


Flow Control 
Protocol Error 


F 


statl " 

%, J 


?S/N? 


Return completion as 
Unsupported Request, 
discard request 


Unsupported 
Request 


F 


Receive^poisoned 
TLP 


S 


Pass TLP thru, unless 
directed to switch, then 
discard (and return UR for 
non-posted requests) 


Poisoned TLP 
Received 


NF 



See conditions for malformed TLPs under section □ Malformed TLP on page 47. 
2 Nexsis Switch will never expect completions, so can just ignore them and let them pass 
thru switch. 



Table 2.5-8 Rx Transaction Layer Errors 
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2.5.5.1.5 Tx Data Link Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


REPLAYNUM 
rolls over 


S 


Retrain the link 


REPLAY _NUM 
Rollover 


C 


REPLAYTIMER 
expires 


s 


Retry entire buffer 


Replay Timeout 


C 



Table 2.5-9 Tx Data Link Layer Errors 
2.5.5.1.6 Tx Transaction Layer Errors 




Error 


Opt 


Packet Behavior 


RepoltedErrorlw 


^ype 


Invalid TC, OSD 
or type 




Nak the transaction coming 
from the Switch Core. 


Jone. \7V^ 




TLP length 
exceeds 

Max Payload Size 




Discard TLP jF" 
(for TLPs with data payload^^ 
only) /\ ' 






Actual packet 
length is greater 
than pkt length 
field. 




Truncate and nullif^acket?V 


Malformed TLP 




Actual packet 
length is less than 
pkt length field 




Nullify ^Skg^^^^ ' 


Malformed TLP 




Completion 
Timeout 1 






Completion 
Timeout 


NF 



1 The Nexsis Switch wi]|^oti^#ahy Refuests requiring Completions, so no timeout should 
ever be enabled \, 




^ 1 ;le &5-1 0 Tx Transaction Layer Errors 



2.6 JQQ% # 

The|| are two TO§ie*components to Quality of Service in our switch. First, the Traffic 
ClassrfJC) fie|d in the transaction allows the driver to differentiate certain transaction 
flows m^pe them be mapped into a Virtual Channels (VC). So a particular message 
might be labeled with a TC of 7 for high-priority, while standard memory reads and 
writes might be labeled with a TC of 0. Our switch supports all 8 VCs. Second, our 
switch allows different OS Domains to share an I/O port, and each OSD will 
automatically receive its fair share of the bandwidth when that port is congested.. 



2.6.1 TC/VC Mapping 

The PCI Express spec lists a 3-bit Traffic Class (TC) field that is present in the 
transaction header. This field is used to differentiate traffic so that it can be prioritized, 
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queued, and scheduled independently inside the various PCI Express devices. Although 
the spec says that anything other than TCO is optional, our switch will accommodate all 8 
traffic classes. 



TC's are mapped onto Virtual Channels (VCs) in a vendor-defined manner according to 
the PCI Express spec. For our switch, each incoming TLP will have its TC field mapped 
into a VC based on software configuration settings. TCO must always be mapped to 
VCO, but all other TCs may be mapped to VCs without restriction individually on a per- 
port basis. 




Each RX MAC will have a TC mapping table that's just a flat-mapped lo$%p table that 
turns an OSD and TC into a Source QID. The Source QID is needed to kno\$|$hichM 



the 16 flow control resources were charged with the credits of the inclining TL1 




Once the destination port and destination QID are returned f^riil 
addressJookup_module, the transaction can actually be Q^ied u||ow that all the data 
for the transaction is known. 

2.6.2 Arbitration points 

There are three arbitration points inside our s^tcfe]^^Mll|^g supported, two in the 
transaction scheduler and one in the data titover Irrtl^e transaction_scheduler, there are 
17 sets of arbiters that run independently; jo^^r|>er^uJput port. Each set of arbiters 
will ensure each input port is allowed^tf^si;a|e of tjfe output port's bandwidth, and 
each OSD is allowed its fair shar^ 
discussed in the next few sectigrfs 

2.6.2. LI Port arbitration (tmnffi^ot^scheduler) 

Port arbitration is the jiJsts^^mh witjln each port_arbiter at an output port (shown as 
ARB1 in the diagram orifthe nl&t page). This level of arbitration will use a simple RR 
scheme to make^^^^^mis sprved and the bandwidths are balanced. This RR 
scheme is fixfifUl! h&dwareHte WRR is supported. 




levels of arbitration will be 



2.6.2.1*2 

Sincsflhere af< 
arbitration (A 




rbnration (transaction_scheduler) 

ut buffer groups for each set of arbiters, a second level of RR 
pick which transaction will be selected as the next one to be 



transmitted onia given output port. This RR scheme is fixed in hardware, no WRR is 

suppoflSfe< 

2.6.2.1.3 Input arbitration (datajnover) 

The datajnover is the final stage (ARB3) that selects which input port is actually 
allowed to transfer data to each output port. Another level of RR is run to make sure 
each input port is serviced when more than one input port has data to send to a given 
output. 

Note that the data_mover will skip over the "best" choice from the transaction_scheduler 
if a different input port that is idle is able to move its data instead. In this case, a skip 
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flag will be set such that the data_mover will not continue skipping the "best" choice 
more than once. Without this skip flag, one output could get very unlucky and get 
skipped over and over, stalling a transaction potentially indefinitely under certain traffic 
loads. 

The following picture shows these levels of arbitration. 



transaction scheduler 




Figure 2.6-1 : Transaction Scheduler Block Diagram 
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2.7 System L v I Managem nt Exam pi s 
2.7.2 Hot Plug Add of I/O Device 

Our device will integrate a hot-plug controller, allowing system software and/or chassis 
management software access to several registers to control hot-plug functions. These 
registers are implemented in the PCI-Express capabilities structure and can be used to 
turn on and off LEDs, exchange information with the Power Controller in the chassis, and 
report the status of the various slots in our system. 

Once a link has been trained, OSDs have been negotiated (if necessary), a&ti FC 
initialization is complete, the hot plug process can begin. V 

For hot insertion, this process will start with the user pressing the Att^|t#§n Butt^f which 
will set a register bit in our switch for those P2P bridge header^. The &QP w?J}i*$end an 
MSI up to the hot plug services software running on each Q^foft^ch O^D that share the 
newly-plugged-in I/O device. The hot plug services socage will blgin device 
discovery. 

2. 7.2. 1. 1 Hot-plug messages 



There are 7 different hot-plug messages defip&Un ^^I-Exp^6ss. Every message of this 
type will be routed to the COP for proces^^(a^es^opkup_module with match the 
bridge header space) if the message is addr^ui^ one|plRhe P2P bridge headers in our 
switch. These are only required if tlj^SB^trtto^tmented on the downstream I/O 
device. If the LEDs are directly b^J|e switc&(u|3ig GPIO), these messages won't be 
used. A ^ 



2. 7.2.1.2 Attention_B^n^res^d / 

The Attention Button^r^s^|register 8t must be set to a 1 if our switch receives this 
message from a djwnstrikn p?l|^If our switch detects the attention button was pressed 
for a given slot (4^^1pi^^^i our switch, one for each port), our switch will generate an 
MSI and sen#it^mto^he hotplug services software. 




2. 7.2.1 ^ JmentioH%JnSicator_On, Attention _Indicator_Blink> 



f Amn^n£lndicator_Off 

Theslyhree messages will be generated by the COP and sent downstream depending on 
the stat%of the^attention indicator bit. 



2.7.2.1.4 Power_Indicator_On, Power_Indicator_Blink, PowerlndicatorOff 

These three messages will be generated by the COP and sent downstream depending on 
the state of the power_indicator bit. 
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Blade 0 Blade 1 



Blade 2 



Blade 3 Blade 4 



Blade 5 



Blade 6 Blade 7 



Figure 2.7-1: Sejarer Chassis^Example 





Busl9 



Bus20 



Figure 2.7-2: Logical Bus View 
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